Easily build quality Knowledge Graphs From Text

Introduction

Building knowledge graphs from raw text using large language models (LLMs) has become increasingly accessible, but several challenges remain when it comes to ensuring the quality of these knowledge graphs. For instance, knowledge graphs can often contain dangling entities—entities that are not connected to anything—along with issues of entity duplication where references to the same entity are treated as distinct. Additionally, extracting relevant information often requires tedious pre-processing of raw text.

In response to these challenges, a library called tex2kg has been developed with the aim of creating high-quality knowledge graphs. Unlike conventional frameworks, this library ensures that all entities are interconnected without duplication. The process includes several key functionalities:

Document Distillation: Instead of analyzing raw text directly, this method distills important information from the text, ensuring that only relevant data is extracted.
Entity and Relationship Extraction: Using LLMs, the distillation process involves incremental or repeated prompting to identify any unconnected entities and establish relationships with other entities. Additionally, embedding models are used to resolve any duplicate entities.
Data Visualization: Following extraction and resolution processes, the final step is to visualize the knowledge graph, making it easier to glean insights.

Once the data is injected into a graph database, querying languages like Cypher can be employed for further insights. The library is user-friendly and can be installed with ease using pip.

Example Workflow

To illustrate the process, let's consider a standard resume as input. We begin by loading the PDF document, which we split into pages for processing. A schema is then defined for the knowledge distillation process—for instance, a CV data model capturing the person's name, phone number, and email, along with details of their work experience and education.

Two embedding models are necessary for this workflow: one for distillation purposes and another for deduplication. In this scenario, we utilize the GPT-4 model for the distillation phase. The visualizer can include helpful prompts, such as specifying the chunk of text being analyzed or indicating that irrelevant sections should remain empty.

The output reveals structured data organized within the defined schema, along with embeddings that help with entity resolution. Finally, the extracted data is injected into a local Neo4j database, resulting in a knowledge graph with multiple entities and relationships centered around the main figure—Emily, in this case.

The result is a rich knowledge graph featuring 19 entities and 18 relationships, showcasing how easily and effectively a structured representation can be created from seemingly unstructured text.

Keywords

Knowledge Graphs
Large Language Models (LLMs)
Document Distillation
Entity Extraction
Relationship Extraction
Entity Resolution
Graph Database
Neo4j
Cypher

FAQ

Q1: What are knowledge graphs?
A1: Knowledge graphs are structured representations of information that depict entities and their relationships.

Q2: What challenges are commonly faced when building knowledge graphs from text?
A2: The primary challenges include dangling entities that lack connections, entity duplication, and the need for pre-processing raw text to extract relevant information.

Q3: How does the tex2kg library address these challenges?
A3: tex2kg addresses these challenges through document distillation, entity and relationship extraction, and data visualization, ensuring interconnected entities without duplication.

Q4: What does the document distillation process entail?
A4: Document distillation involves extracting relevant information from raw text, using a predefined schema to filter unnecessary content, and ensuring that unimportant sections remain empty.

Q5: How can I visualize the knowledge graph once it is built?
A5: After injecting the data into a graph database like Neo4j, you can use query languages such as Cypher to visualize and analyze the knowledge graph.