Chat with Multiple PDFs | LangChain App Tutorial in Python (Free LLMs and Embeddings)

Good morning, everyone! Today, welcome to this new video tutorial where I'll show you exactly how to build a remarkable application. This project is a chatbot that allows you to chat with multiple PDFs from your computer at once. Let's dive in.

Chatbot

How It Works

The application demonstrated allows users to upload multiple PDFs and process them to ask relevant questions. For this example, I uploaded the Constitution and the Bill of Rights. Upon processing, the documents are embedded into a vector store database, enabling users to ask questions such as "What are the three branches of the United States government?" and get answers based on the uploaded PDFs.

Setting Up the Environment

Creating a Virtual Environment:

python -m venv myenv
source myenv/bin/activate

Installing Dependencies:

pip install streamlit
pip install pypdf2
pip install langchain
pip install python-dotenv
pip install faiss-cpu
pip install openai
pip install huggingface_hub

Graphical User Interface (GUI)

To build the GUI, we make use of Streamlit, a powerful tool for creating web apps in Python.

Setting Page Configuration:

import streamlit as st
st.set_page_config(page_title="Chat with Multiple PDFs", page_icon="?")

Adding a Header and Sidebar:

st.header("Chat with Multiple PDFs ?")
query = st.text_input("Ask a question about your documents here")

with st.sidebar:
    st.subheader("Your Documents")
    pdf_docs = st.file_uploader("Upload your PDFs here and click on process", accept_multiple_files=True)
    process = st.button("Process")

Backend Logic

Processing PDF Documents

from PyPDF2 import PdfReader

def get_pdf_text(pdf_docs):
    text = ""
    for pdf in pdf_docs:
        pdf_reader = PdfReader(pdf)
        for page in pdf_reader.pages:
            text += page.extract_text()
    return text

Splitting Text into Chunks

from langchain.text_splitter import CharacterTextSplitter

def get_text_chunks(text):
    text_splitter = CharacterTextSplitter(separator="\n", chunk_size=1000, chunk_overlap=200, length_function=len)
    chunks = text_splitter.split_text(text)
    return chunks

Creating Vector Store with OpenAI Embeddings

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS

def get_vector_store(chunks):
    embeddings = OpenAIEmbeddings()
    vector_store = FAISS.from_texts(chunks, embeddings)
    return vector_store

Creating a Conversational Chain

from langchain.chains import ConversationalRetrievalChain
from langchain.llms import OpenAI

def create_conversation_chain(vector_store):
    memory = ConversationalBufferMemory(memory_key="chat_history", return_messages=True)
    llm = [OpenAI(chat_model=](https://www.topview.ai/blog/detail/chatgpt-https-chat-openai-com-auth-login)"gpt-3.5-turbo", api_key="YOUR_OPENAI_API_KEY")
    conversation_chain = ConversationalRetrievalChain.from_llm(llm, vector_store.as_retriever(), memory=memory)
    return conversation_chain

Running the Application

if __name__ == "__main__":
    if process:
        with st.spinner("Processing"):
            raw_text = get_pdf_text(pdf_docs)
            chunked_text = get_text_chunks(raw_text)
            vector_store = get_vector_store(chunked_text)
            conversation = create_conversation_chain(vector_store)
            st.session_state.conversation = conversation
    
    if query:
        user_message = query
        response = st.session_state.conversation(('question': user_message))
        st.write(response['chat_history'])

Displaying Chat Messages with HTML Templates

import streamlit as st

CSS = """
<style>
/* Your CSS code */
</style>
"""

USER_TEMPLATE = """
<div class="chat_message">
    <div class="user">
        <img src="https://user_image_url.com" alt="User">
        (message)
    </div>
</div>
"""

BOT_TEMPLATE = """
<div class="chat_message">
    <div class="bot">
        <img src="https://bot_image_url.com" alt="Bot">
        (message)
    </div>
</div>
"""

st.write(CSS, unsafe_allow_html=True)

## Introduction
st.write(USER_TEMPLATE.replace("(message)", user_message), unsafe_allow_html=True)
st.write(BOT_TEMPLATE.replace("(message)", bot_response), unsafe_allow_html=True)

Conclusion

Congratulations on following along to the end! You've successfully built a sophisticated chatbot that can manage multiple PDF documents and provide intelligent responses based on their contents.

Don't forget to subscribe and leave any questions in the comments.

Keywords

PDF
Chatbot
LangChain
Streamlit
Python
OpenAI
HuggingFace
Vector Store
Embeddings
Conversational Chain

FAQ

What libraries do I need to install for this project?
- You will need Streamlit, PyPDF2, LangChain, Python-dotenv, Faiss-cpu, OpenAI, and HuggingFace_hub.
How do I split the text into manageable chunks?
- Use the CharacterTextSplitter class from LangChain.
Which OpenAI function creates the embeddings for vector storage?
- Use OpenAIEmbeddings from LangChain for embedding the chunks of text.
Can I use HuggingFace models instead of OpenAI?
- Yes, you can use HuggingFace models like google_flant5_base and integrate them similarly as shown.
How do you make variables persistent in Streamlit?
- Use st.session_state to keep variables persistent throughout the session.
Is there a way to use free models for this project?
- Yes, you can use HuggingFace's instructor_transformer model to create embeddings for free.
What if my embeddings process is too slow?
- Consider using a GPU for faster embedding processing or utilize cloud-hosted services like OpenAI or HuggingFace API.