How to Summarize PDF Using LangChain | OpenAI | Gradio

Science & Technology


Introduction

Hello everyone and welcome back! Today, I am thrilled to introduce the documentation part of LangChain videos, as well as some practical use cases. The first exciting use case is crafting a PDF summarizer. Here’s a detailed walkthrough of the steps you'll need to undertake:

Getting Started

To begin, we'll discuss the various components needed before diving into the summarization process. We'll start with a simple Python code snippet that achieves the summarization. If you're someone who prefers a more interactive UI, I will also demonstrate how to accomplish the same task using Gradio. Let's dive in!

Install Necessary Packages

The first step is to install the required packages. Below is a list of the packages you need:

pip install gradio openai pypdf ticktoken langchain

You can find more information about these packages from the provided links in the documentation.

OpenAI API Key

Because we are using OpenAI, you will need an API key. You can obtain it from OpenAI's official website. Replace the api_key variable with your API key.

Understanding TickToken

TickToken is OpenAI's tokenizer. Here's a simple function to help you understand how it works:

import tiktoken

def get_token_length(string, encoding_name='gpt-3.5-turbo'):
    encoding = tiktoken.get_encoding(encoding_name)
    tokens = encoding.encode(string)
    return len(tokens)

string = "TickTok is great!"
print(get_token_length(string))

Importing Necessary Libraries

We need several libraries like Gradio, LangChain, OpenAI, and PyPDF. Here are the imports:

import gradio as gr
from langchain import OpenAI, PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain.docstore.document import Document
from langchain.document_loaders import PyPDFLoader

Load and Parse PDF

We will use a technical report as a sample PDF. Download it using:

wget <URL_TO_PDF_FILE>

Load the PDF and split it into chunks:

loader = PyPDFLoader("path/to/your/pdf.pdf")
doc = loader.load_and_split()
chunks = doc.split()

Create a Summarization Function

Here's a function to perform the summarization:

def summarize_pdf(pdf_path):
    loader = PyPDFLoader(pdf_path)
    doc = loader.load_and_split()
    
    llm = OpenAI(model_name="text-davinci-002", api_key="YOUR_API_KEY")
    chain = load_summarize_chain(llm, chain_type="map_reduce")

    summary = chain.run(doc)
    return summary

Running the Summarizer

To run the summarizer:

summary = summarize_pdf("path/to/your/pdf.pdf")
print(summary)

Creating a UI with Gradio

If you prefer a UI, here’s how you can do it with Gradio:

import gradio as gr

def summarize_ui(pdf_path):
    return summarize_pdf(pdf_path)

gr_interface = gr.Interface(
    fn=summarize_ui,
    inputs="text",
    outputs="text",
    [title="PDF Summarizer"](https://www.topview.ai/blog/detail/summarize-your-pdf-s-in-just-few-seconds),
    description="Provide the PDF file path",
)

gr_interface.launch(share=True)

Executing the cell will generate a Gradio interface. Copy the path of your PDF file, paste it into the input field, and click the submit button to get the summarized text.

Conclusion

Creating a PDF summarizer is straightforward whether you prefer a command-line approach or an interactive UI. Stay tuned for more LangChain use cases!

Keywords

  1. LangChain
  2. PDF Summarizer
  3. OpenAI
  4. Gradio
  5. TickToken
  6. PyPDFLoader
  7. API Key
  8. Chunking
  9. Summarization Chain
  10. Interface

FAQ

  1. What is LangChain?

    • LangChain is a tool for creating language model chains for performing various tasks like summarizing, generating text, and more.
  2. How do I get an OpenAI API key?

  3. What is the purpose of TickToken?

    • TickToken is OpenAI's tokenizer used to convert text into tokens.
  4. How do I install the required packages?

    • You can install the necessary packages using the command: pip install gradio openai pypdf ticktoken langchain.
  5. Can I use the summarizer without a UI?

    • Yes, you can run the script directly in Python without using Gradio for a UI-based approach.
  6. What is PyPDFLoader?

    • PyPDFLoader is a tool for loading and parsing PDF documents in Python.
  7. Can I share the Gradio interface with others?