From CSV To GraphRAG Systems With Neo4j And LangChain | Knowledge Graphs RAG
Science & Technology
Introduction
In this article, we'll explore how to build a simple Graph Recommender Application (GraphRAG) using Neo4j and LangChain by transforming a CSV file dataset into a Knowledge Graph. We will utilize the Northwind dataset to demonstrate the entire process, including data pre-processing, database formation using Cypher queries, and the eventual construction of a GraphRAG application.
Overview
The steps to build this application include:
- Data Preparation: We will begin with a collection of CSV files from the Northwind dataset and carry out various pre-processing tasks such as combining data frames to create a unified dataset.
- Knowledge Graph Creation: Next, we will connect to a Neo4j instance and write Cypher code to insert this pre-processed data into our Knowledge Graph.
- GraphRAG Application Development: Once we have successfully populated our Knowledge Graph, we will build a simple GraphRAG application using LangChain.
The project will be delivered in two parts: this article will focus on data preparation and graph creation, while the subsequent piece will cover building the application on top of the created knowledge graph.
Data Preparation
To begin, the Northwind dataset consists of multiple CSV files. These files include categories, customers, orders, products, suppliers, and employees. For our implementation, we will utilize the Pandas library to read and process these files.
Step 1: Installing Required Dependencies
Before we start coding, ensure you have installed the required libraries, including Pandas and the Neo4j Python driver.
pip install pandas neo4j
Step 2: Loading CSV Files
Start by loading the CSV files into Pandas DataFrames:
import pandas as pd
category_df = pd.read_csv('data/category.csv')
product_df = pd.read_csv('data/products.csv')
supplier_df = pd.read_csv('data/suppliers.csv')
Once loaded, you can preview the data:
print(category_df.head())
Step 3: Merging DataFrames
Now, we need to combine the data frames for more comprehensive analysis. Using primary and foreign keys will help to merge the tables accurately.
product_category_df = pd.merge(product_df, category_df, on='CategoryID')
product_supplier_df = pd.merge(product_category_df, supplier_df, on='SupplierID')
Step 4: Data Cleaning
Data cleaning is vital to ensure quality. We’ll fill missing values counterproductively with the term 'unknown'.
product_supplier_df.fillna('unknown', inplace=True)
Step 5: Creating the Neo4j Connection
Once the data is prepped, we will establish a connection with our Neo4j database using the Neo4j Python driver.
from neo4j import GraphDatabase
uri = "neo4j://localhost:7687"
driver = GraphDatabase.driver(uri, auth=("username", "password"))
Step 6: Inserting Data into Neo4j Database
Now, we can write Cypher queries to insert our data into the Neo4j database.
with driver.session() as session:
for row in product_supplier_df.itertuples():
session.run("CREATE (p:Product (id: $id, name: $name))", id=row.ProductID, name=row.ProductName)
# Repeat similarly for categories, suppliers.
Step 7: Creating Relationships
After inserting entities, we can create relationships between products, categories, and suppliers.
session.run("""
MATCH (p:Product), (c:Category)
WHERE p.id = $product_id AND c.id = $category_id
CREATE (p)-[:BELONGS_TO]->(c)
""", product_id='example_id', category_id='example_category_id')
Graph Application Development with LangChain
In a follow-up article, we will delve into the development of a GraphRAG application, building on the Neo4j Knowledge Graph created in this part.
Keywords
Neo4j, LangChain, GraphRAG, CSV, Knowledge Graph, Data Pre-processing, Cypher, Pandas, DataFrames, Relationships.
FAQ
Q: What is the Northwind dataset?
A: Northwind is a sample database that provides various structured data about a fictitious company's sales, customers, and products, often used for demonstration and educational purposes.
Q: What is a Knowledge Graph?
A: A Knowledge Graph is a structured representation of information that defines entities and the relationships between them, allowing for better data retrieval and interpretation.
Q: How does Neo4j help in managing Knowledge Graphs?
A: Neo4j is a graph database that excels in managing and querying structured data through relationships, making it ideal for implementing Knowledge Graphs.
Q: What is LangChain?
A: LangChain is a framework designed to assist in building applications that integrate with language models, such as constructing Graph Recommender Applications.
Q: Why is data pre-processing important?
A: Data pre-processing ensures data quality by cleaning, merging, and transforming datasets before they are used in knowledge graphs or machine learning models, providing reliable outputs.