Generating PDF Summaries Using ChatGPT with Python

November 14, 2023

In today’s data-driven world, extracting valuable insights from a multitude of PDF documents is a common challenge. Fortunately, with the power of Python and AI, you can automate the process of summarizing PDFs using ChatGPT. In this blog, we’ll walk you through the steps to achieve this task efficiently.

How can I use ChatGPT to create a summary of a PDF document?

Please make sure to install the following dependencies: Flask, azure-cognitiveservices-vision-computervision, PyMuPDF, long-chain, and openai version 0.28.1.

Step 1: Uploading PDFs via Flask

We begin by setting up a Python application using Flask to create an API for PDF upload. Users can conveniently send their PDF documents through this interface, making the process user-friendly.

from flask import Flask, request
app = Flask(__name__)@app.route(‘/upload_pdf’, methods = [‘POST’])
def main():
files = request.files[‘pdf’]
return convert_pdf_to_jpg(files,files.name)if __name__ == ‘__main__’:
app.run(debug = True)

A Flask app with a route ‘/upload_pdf’ for POST requests. It handles PDF uploads and starts to convert them to JPG images using the convert_pdf_to_jpg function.

Step 2: Converting PDF Pages to JPG

To work with the content of PDFs, we utilize the fitz library to convert each page of the PDF into a JPG image. This step ensures that the text within the PDF is in a format that can be processed further.

def convert_pdf_to_jpg(pdf_file,name_without_extension):
# Open the PDF filepdf_document = fitz.open(pdf_file)# Create a directory to save the images
output_dir = “Extraction_images/”if not os.path.exists(‘./Extraction_images’):
os.makedirs(‘./Extraction_images’)

# Loop through the pages and convert to images
for page_num in range(pdf_document.page_count):
page = pdf_document.load_page(page_num)
image = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72)) # Adjust the resolution as needed
image.save(f”{output_dir}{name_without_extension}_{page_num + 1}.jpg”)

# Close the PDF document
pdf_document.close()

return Extract_text_from_jpg(pdf_file)

The code converts a PDF into JPG images, creating an output directory for the images. It loops through the PDF pages, adjusts the resolution, and saves them as images. Finally, it calls combine_pdf(pdf_file).

Step 3: Optical Character Recognition (OCR)

With our PDF pages in image format, we employ Azure OCR Cognitive Services to extract text from each JPG file. This text is then compiled and organized into a single text file.

from azure.cognitiveservices.vision.computervision import ComputerVisionClient
from azure.cognitiveservices.vision.computervision.models import OperationStatusCodes
from azure.cognitiveservices.vision.computervision.models import VisualFeatureTypes
from msrest.authentication import CognitiveServicesCredentialsdef Azure_Client():
subscription_key = “”endpoint = “”return ComputerVisionClient(
endpoint, CognitiveServicesCredentials(subscription_key)
)def txt_to_file(file_path, string_to_write):
try:
print(file_path)
with open(file_path, “w”,encoding=“utf-8”) as file:
file.write(string_to_write)
print(“OCR Completed”)
except IOError:
print(“An error occurred while writing to the file.”)def ocr_single_file(computervision_client, pdf_path):file = open(pdf_path, “rb”)read_response = computervision_client.read_in_stream(
file,raw=True
)
file.close()
# Get the operation location (URL with an ID at the end) from the response
read_operation_location = read_response.headers[“Operation-Location”]
# Grab the ID from the URLoperation_id = read_operation_location.split(“/”)[-1]
# Call the “GET” API and wait for it to retrieve the results
while True:
read_result = computervision_client.get_read_result(operation_id)
print(read_result)
if read_result.status not in [“notStarted”, “running”]:
break
time.sleep(6)text = “”
# Print the detected text, line by line
if read_result.status == OperationStatusCodes.succeeded:
for text_result in read_result.analyze_result.read_results:
for line in text_result.lines:

text += “\n” + line.text
text += “\n\n”

return text

def create_folder(pdf_path):
# Extract the file name from the location string
file_name = os.path.basename(pdf_path)
output_dir = os.path.splitext(file_name)[0]
# to create a output folder
output_dir = “Extraction_text/” + output_dir
if not os.path.exists(output_dir):
os.makedirs(output_dir,)
output_file = output_dir + “/” + “ocr.txt”
return output_file

def Extract_text_from_jpg(file):
file.save(‘./’ + file.filename)
folder_path = “./Extraction_images”

# List all PDF files in the folder
jpg_files = glob.glob(os.path.join(folder_path, “*.jpg”))

jpg_files = natsorted(jpg_files)

for jpgfile in jpg_files:
pdf_path = jpgfile
computervision_client = Azure_Client()
txt_file = ocr_single_file(computervision_client, pdf_path)
file_name = create_folder(pdf_path=pdf_path)
print(“__________________”)
print(
“############################################################################”
)
txt_to_file(file_name, txt_file)

Define Azure_Client function: This function sets up the Azure Cognitive Services client by providing the subscription key and endpoint.
txt_to_file function: This function writes text to a file provided by file_path.
ocr_single_file function: This function performs OCR on a single PDF file. It reads the PDF and retrieves the text content using Azure Cognitive Services. It waits for the operation to complete and returns the extracted text.
create_folder function: This function creates an output folder based on the PDF file’s name to store the OCR results in a text file.
Extract_text_from_jpg function: This function handles the PDF file upload and OCR conversion. It saves uploaded files, processes each image (jpg) in the PDF, and extracts text from them using Azure Cognitive Services. The extracted text is saved to a corresponding output folder as a text file.

Overall, the code processes PDF files, extracts text from images within them, and stores the extracted text in individual text files in output folders.

Step 4: Text Chunking with langchain

To make the text more manageable, we implement the langchain library and use its RecursiveCharacterTextSplitter feature. This allows us to divide the text into smaller more digestible chunks. The chunk_size and separator parameters help customize the splitting process to suit your needs.

from langchain.text_splitter import RecursiveCharacterTextSplitter
import tiktoken
def text_spliting():
folder_path = “./”
# Get the text file
latest_text_path = glob.glob(os.path.join(folder_path, “*.txt”))with open(latest_text_path[0],“rb”) as t:
prompt_text = t.read()prompt_text = prompt_text.replace(‘\\r’,”)chunk_size = 10000
separators = [‘\\n\\n\\n’]
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
chunk_size=chunk_size,
chunk_overlap=0,
separators=separators
)docs = text_splitter.split_text(prompt_text)print(len(docs),“after chunk”)with open(‘./chunk_output.txt’,‘a’,encoding=“utf-8”) as f:for i,k in enumerate(docs):f.write(k)
f.write(“\n\n\n\n\n”)
f.write(“Next chunk”)return {“chunk_doc”:“chunk_output.txt”}

The code utilizes the Langchain library for text splitting. It reads a text file, divides it into smaller chunks based on specified separators, and then saves the resulting chunks in a separate file. The code returns information about the processed chunks in a dictionary.

Step 5: Summarization with ChatGPT

As ChatGPT processes each text chunk, it generates corresponding summaries. These summaries are collected and assembled into a final text file. This consolidated document provides a concise yet comprehensive overview of the original PDF content.

import openai
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env fileopenai.api_key = os.getenv(‘OPENAI_API_KEY’)def get_completion(prompt, model=“gpt-3.5-turbo”):
response = openai.ChatCompletion.create(
model=model,
messages=messages,
temperature=0, # this is the degree of randomness of the model’s output
)
return response.choices[0].message[“content”]def summarize_prompt():
res = text_spliting()
summarize_file = open(‘./summarize_output.txt’,‘a’,encoding=“utf-8”)
for i,text in enumerate(res[‘chunk_doc’]):prompt = f”””
your task is to generate a short summary of this domain “Example”
Summarize the paragraph below, delimited by triple backticksSummarize the review below, delimited by triple
backticks. Produce results that encompass
both concise summaries and bullet-pointed insights
summary: “`{text}“` “””summary = get_completion(prompt=prompt,model = “gpt-3.5-turbo”)summarize_file.write(summary)summarize_file.write(“\n\n\n”)

The code uses the OpenAI API to generate summaries for text chunks. It loads the OpenAI API key from a local .env file, defines a function get_completion to retrieve text completions, and another function summarize_prompt to split text, generate summaries for each chunk, and write the summaries to an output file. The code is designed for summarizing text data related to the “example” domain.

Step 6: Delivering the Summarized Text

The final text file, containing all the summarized information, is ready to be delivered to the client. This step ensures that the extracted insights are readily accessible and easy to understand.

By following these steps, you can streamline the process of extracting valuable information from PDF documents using Python(OpenAI) and ChatGPT. This automated approach not only saves time but also ensures accuracy and consistency in your summarization tasks.

– – –

Are you ready to supercharge your PDF summarization process with the power of AI and Python? Try out these steps and transform the way you handle PDF documents. It’s a game-changer for researchers, professionals, and anyone dealing with large volumes of textual data.

Conclusion:

With Python, Flask, Azure OCR, ChatGPT, and thoughtful libraries like langchain, you can transform PDFs into concise, actionable insights. By automating the summarization process, you save time and enhance your document handling efficiency. Embrace the power of AI and take your PDF summarization to the next level.

Thank you for reading our blog! We hope you found it helpful. We’d love to hear your feedback. Please feel free to share your thoughts and suggestions on how we can improve or any other topics you’d like us to cover in the future. Your input is valuable to us.

Faq's

How can I automate PDF summarization with Python and ChatGPT?

You can automate PDF summarization by building a Python pipeline that first uploads PDFs via Flask, converts pages to images using PyMuPDF, extracts text using Azure OCR, then slices that text into smaller chunks before sending them to the ChatGPT API for summarization This end-to-end workflow allows you to handle large or scanned documents and generate coherent summaries programmatically.

Why is OCR necessary before summarizing PDFs with ChatGPT?

Optical Character Recognition (OCR) is essential when PDFs contain images or scanned text. Using services like Azure’s Computer Vision extracts readable text from page images, ensuring the ChatGPT pipeline receives accurate, text-based input. Without OCR, the model cannot interpret visual content, making effective summarization impossible.

What is text chunking and why is it used in this workflow?

Text chunking is the process of splitting long text into manageable segments—an essential technique when using ChatGPT through LangChain or similar libraries. This ensures each chunk stays within token limits, and when combined, the extracted summaries still maintain context and coherence across the document.

How do Flask and PyMuPDF work together in PDF summarization scripts?

Flask provides a simple API interface for uploading PDF files to the application. Once uploaded, PyMuPDF (via the fitz module) converts each PDF page into high-resolution JPG images, which are then processed by OCR to extract text. This combination makes the pipeline both user-friendly and flexible for further processing

What are best practices for summarizing large PDFs with ChatGPT?

Best practices include breaking long documents into smaller text chunks, summarizing each piece with prompts tailored to your goal, and recursively combining summaries. This technique preserves content depth while staying within token limits. Resources like the Tilburg.ai guide and Reddit discussions also recommend iterative summarization and prompt engineering to improve quality

Faq's

How can I automate PDF summarization with Python and ChatGPT?

Why is OCR necessary before summarizing PDFs with ChatGPT?

What is text chunking and why is it used in this workflow?

How do Flask and PyMuPDF work together in PDF summarization scripts?

What are best practices for summarizing large PDFs with ChatGPT?

Company

Services

GenAI Accelerators

Connect with us