How to parse PDF docs for RAG

,
Sep 29, 2024
Open in Github

This notebook shows how to leverage GPT-4o to turn rich PDF documents such as slide decks or exports from web pages into usable content for your RAG application.

This technique can be used if you have a lot of unstructured data containing valuable information that you want to be able to retrieve as part of your RAG pipeline.

For example, you could build a Knowledge Assistant that could answer user queries about your company or product based on information contained in PDF documents.

The example documents used in this notebook are located at data/example_pdfs. They are related to OpenAI's APIs and various techniques that can be used as part of LLM projects.

Data preparation

In this section, we will process our input data to prepare it for retrieval.

We will do this in 2 ways:

  1. Extracting text with pdfminer
  2. Converting the PDF pages to images to analyze them with GPT-4o

You can skip the 1st method if you want to only use the content inferred from the image analysis.

Setup

We need to install a few libraries to convert the PDF to images and extract the text (optional).

Note: You need to install poppler on your machine for the pdf2image library to work. You can follow the instructions to install it here.

%pip install pdf2image -q
%pip install pdfminer -q
%pip install pdfminer.six -q
%pip install openai -q
%pip install scikit-learn -q
%pip install rich -q
%pip install tqdm -q
%pip install pandas -q
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
# Imports
from pdf2image import convert_from_path
from pdf2image.exceptions import (
    PDFInfoNotInstalledError,
    PDFPageCountError,
    PDFSyntaxError
)
from pdfminer.high_level import extract_text
import base64
import io
import os
import concurrent.futures
from tqdm import tqdm
from openai import OpenAI
import re
import pandas as pd 
from sklearn.metrics.pairwise import cosine_similarity
import json
import numpy as np
from rich import print
from ast import literal_eval
def convert_doc_to_images(path):
    images = convert_from_path(path)
    return images

def extract_text_from_doc(path):
    text = extract_text(path)
    return text
file_path = "data/example_pdfs/fine-tuning-deck.pdf"

images = convert_doc_to_images(file_path)
text = extract_text_from_doc(file_path)
for img in images:
    display(img)
image generated by notebookimage generated by notebookimage generated by notebookimage generated by notebookimage generated by notebook