Generating PDF Summaries Using ChatGPT with Python

In today’s data-driven world, extracting valuable insights from a multitude of PDF documents is a common challenge. Fortunately, with the power of Python and AI, you can automate the process of summarizing PDFs using ChatGPT. In this blog, we’ll walk you through the steps to achieve this task efficiently.

How can I use ChatGPT to create a summary of a PDF document?

Please make sure to install the following dependencies: Flask, azure-cognitiveservices-vision-computervision, PyMuPDF, long-chain, and openai version 0.28.1.

 

Step 1: Uploading PDFs via Flask

We begin by setting up a Python application using Flask to create an API for PDF upload. Users can conveniently send their PDF documents through this interface, making the process user-friendly.

from flask import Flask, request
app = Flask(__name__)@app.route(‘/upload_pdf’, methods = [‘POST’])
def main():
files = request.files[‘pdf’]
return convert_pdf_to_jpg(files,files.name)if __name__ == ‘__main__’:
app.run(debug = True)

A Flask app with a route ‘/upload_pdf’ for POST requests. It handles PDF uploads and starts to convert them to JPG images using the convert_pdf_to_jpg function.

 

Step 2: Converting PDF Pages to JPG

To work with the content of PDFs, we utilize the fitz library to convert each page of the PDF into a JPG image. This step ensures that the text within the PDF is in a format that can be processed further.

def convert_pdf_to_jpg(pdf_file,name_without_extension):
# Open the PDF filepdf_document = fitz.open(pdf_file)# Create a directory to save the images
output_dir = “Extraction_images/”

if not os.path.exists(‘./Extraction_images’):
os.makedirs(‘./Extraction_images’)

# Loop through the pages and convert to images
for page_num in range(pdf_document.page_count):
page = pdf_document.load_page(page_num)
image = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72)) # Adjust the resolution as needed
image.save(f”{output_dir}{name_without_extension}_{page_num + 1}.jpg”)

# Close the PDF document
pdf_document.close()

return Extract_text_from_jpg(pdf_file)

The code converts a PDF into JPG images, creating an output directory for the images. It loops through the PDF pages, adjusts the resolution, and saves them as images. Finally, it calls combine_pdf(pdf_file).

 

Step 3: Optical Character Recognition (OCR)

With our PDF pages in image format, we employ Azure OCR Cognitive Services to extract text from each JPG file. This text is then compiled and organized into a single text file.

from azure.cognitiveservices.vision.computervision import ComputerVisionClient
from azure.cognitiveservices.vision.computervision.models import OperationStatusCodes
from azure.cognitiveservices.vision.computervision.models import VisualFeatureTypes
from msrest.authentication import CognitiveServicesCredentialsdef Azure_Client():
subscription_key = “”endpoint = “”return ComputerVisionClient(
endpoint, CognitiveServicesCredentials(subscription_key)
)def txt_to_file(file_path, string_to_write):
try:
print(file_path)
with open(file_path, “w”,encoding=“utf-8”) as file:
file.write(string_to_write)
print(“OCR Completed”)
except IOError:
print(“An error occurred while writing to the file.”)def ocr_single_file(computervision_client, pdf_path):file = open(pdf_path, “rb”)read_response = computervision_client.read_in_stream(
file,raw=True
)
file.close()
# Get the operation location (URL with an ID at the end) from the response
read_operation_location = read_response.headers[“Operation-Location”]
# Grab the ID from the URLoperation_id = read_operation_location.split(“/”)[-1]
# Call the “GET” API and wait for it to retrieve the results
while True:
read_result = computervision_client.get_read_result(operation_id)
print(read_result)
if read_result.status not in [“notStarted”, “running”]:
break
time.sleep(6)

text = “”
# Print the detected text, line by line
if read_result.status == OperationStatusCodes.succeeded:
for text_result in read_result.analyze_result.read_results:
for line in text_result.lines:

text += “\n” + line.text
text += “\n\n”

return text

def create_folder(pdf_path):
# Extract the file name from the location string
file_name = os.path.basename(pdf_path)
output_dir = os.path.splitext(file_name)[0]
# to create a output folder
output_dir = “Extraction_text/” + output_dir
if not os.path.exists(output_dir):
os.makedirs(output_dir,)
output_file = output_dir + “/” + “ocr.txt”
return output_file

def Extract_text_from_jpg(file):
file.save(‘./’ + file.filename)
folder_path = “./Extraction_images”

# List all PDF files in the folder
jpg_files = glob.glob(os.path.join(folder_path, “*.jpg”))

jpg_files = natsorted(jpg_files)

for jpgfile in jpg_files:
pdf_path = jpgfile
computervision_client = Azure_Client()
txt_file = ocr_single_file(computervision_client, pdf_path)
file_name = create_folder(pdf_path=pdf_path)
print(“__________________”)
print(
“############################################################################”
)
txt_to_file(file_name, txt_file)

  • Define Azure_Client function: This function sets up the Azure Cognitive Services client by providing the subscription key and endpoint.
  • txt_to_file function: This function writes text to a file provided by file_path.
  • ocr_single_file function: This function performs OCR on a single PDF file. It reads the PDF and retrieves the text content using Azure Cognitive Services. It waits for the operation to complete and returns the extracted text.
  • create_folder function: This function creates an output folder based on the PDF file’s name to store the OCR results in a text file.
  • Extract_text_from_jpg function: This function handles the PDF file upload and OCR conversion. It saves uploaded files, processes each image (jpg) in the PDF, and extracts text from them using Azure Cognitive Services. The extracted text is saved to a corresponding output folder as a text file.

Overall, the code processes PDF files, extracts text from images within them, and stores the extracted text in individual text files in output folders.

 

Step 4: Text Chunking with langchain

To make the text more manageable, we implement the langchain library and use its RecursiveCharacterTextSplitter feature. This allows us to divide the text into smaller more digestible chunks. The chunk_size and separator parameters help customize the splitting process to suit your needs.

from langchain.text_splitter import RecursiveCharacterTextSplitter
import tiktoken
def text_spliting():
folder_path = “./”
# Get the text file
latest_text_path = glob.glob(os.path.join(folder_path, “*.txt”))with open(latest_text_path[0],“rb”) as t:
prompt_text = t.read()prompt_text = prompt_text.replace(‘\\r’,)chunk_size = 10000
separators = [‘\\n\\n\\n’]
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
chunk_size=chunk_size,
chunk_overlap=0,
separators=separators
)docs = text_splitter.split_text(prompt_text)print(len(docs),“after chunk”)with open(‘./chunk_output.txt’,‘a’,encoding=“utf-8”) as f:for i,k in enumerate(docs):f.write(k)
f.write(“\n\n\n\n\n”)
f.write(“Next chunk”)

return {“chunk_doc”:“chunk_output.txt”}

The code utilizes the Langchain library for text splitting. It reads a text file, divides it into smaller chunks based on specified separators, and then saves the resulting chunks in a separate file. The code returns information about the processed chunks in a dictionary.

 

Step 5: Summarization with ChatGPT

As ChatGPT processes each text chunk, it generates corresponding summaries. These summaries are collected and assembled into a final text file. This consolidated document provides a concise yet comprehensive overview of the original PDF content.

import openai
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env fileopenai.api_key = os.getenv(‘OPENAI_API_KEY’)def get_completion(prompt, model=“gpt-3.5-turbo”):
response = openai.ChatCompletion.create(
model=model,
messages=messages,
temperature=0, # this is the degree of randomness of the model’s output
)
return response.choices[0].message[“content”]def summarize_prompt():
res = text_spliting()
summarize_file = open(‘./summarize_output.txt’,‘a’,encoding=“utf-8”)
for i,text in enumerate(res[‘chunk_doc’]):prompt = f”””
your task is to generate a short summary of this domain “Example”
Summarize the paragraph below, delimited by triple backticks
Summarize the review below, delimited by triple
backticks. Produce results that encompass
both concise summaries and bullet-pointed insights
summary: “`{text}“` “””summary = get_completion(prompt=prompt,model = “gpt-3.5-turbo”)summarize_file.write(summary)summarize_file.write(“\n\n\n”)

The code uses the OpenAI API to generate summaries for text chunks. It loads the OpenAI API key from a local .env file, defines a function get_completion to retrieve text completions, and another function summarize_prompt to split text, generate summaries for each chunk, and write the summaries to an output file. The code is designed for summarizing text data related to the “example” domain.

 

Step 6: Delivering the Summarized Text

The final text file, containing all the summarized information, is ready to be delivered to the client. This step ensures that the extracted insights are readily accessible and easy to understand.

By following these steps, you can streamline the process of extracting valuable information from PDF documents using Python(OpenAI) and ChatGPT. This automated approach not only saves time but also ensures accuracy and consistency in your summarization tasks.

Are you ready to supercharge your PDF summarization process with the power of AI and Python? Try out these steps and transform the way you handle PDF documents. It’s a game-changer for researchers, professionals, and anyone dealing with large volumes of textual data.

Conclusion:

With Python, Flask, Azure OCR, ChatGPT, and thoughtful libraries like langchain, you can transform PDFs into concise, actionable insights. By automating the summarization process, you save time and enhance your document handling efficiency. Embrace the power of AI and take your PDF summarization to the next level.

Thank you for reading our blog! We hope you found it helpful. We’d love to hear your feedback. Please feel free to share your thoughts and suggestions on how we can improve or any other topics you’d like us to cover in the future. Your input is valuable to us.

 

Connect With Us!