In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Graph RAG > Building a Knowledge Graph from Documents with LLM Extraction

Building a Knowledge Graph from Documents with LLM Extraction

Author: Venkata Sudhakar

ShopMax India has hundreds of product manuals, warranty documents, and supplier agreements stored as text files. To enable Graph RAG, these documents must be converted into a structured knowledge graph of entities and relationships. An LLM automatically extracts entities like products, suppliers, warranty terms, and cities, and defines the relationships between them.

The extraction pipeline works in three steps: chunk the documents, prompt the LLM to extract entities and relationships from each chunk as JSON, and load the results into Neo4j. The LLM identifies entity types (Product, Supplier, City, WarrantyTerm) and relationship types (SUPPLIES, COVERS, LOCATED_IN). Duplicate entities are merged using MERGE statements in Cypher.

The below example shows how ShopMax India extracts a knowledge graph from product warranty documents using OpenAI with JSON mode and loads it into Neo4j.

from openai import OpenAI
from neo4j import GraphDatabase
import json

client = OpenAI(api_key="your-api-key")
neo4j = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

EXTRACT_PROMPT = """Extract entities and relationships from the text.
Return JSON with keys: entities (list of name, type) and relationships (list of from, type, to).
Entity types: Product, Supplier, City, WarrantyTerm
Relationship types: SUPPLIES, COVERS, LOCATED_IN
Text: """

def extract_graph(text):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        response_format={"type": "json_object"},
        messages=[{"role": "user", "content": EXTRACT_PROMPT + text}]
    )
    return json.loads(response.choices[0].message.content)

def load_to_neo4j(graph_data):
    with neo4j.session() as session:
        for entity in graph_data.get("entities", []):
            session.run(
                "MERGE (n:" + entity["type"] + " {name: $name})",
                name=entity["name"]
            )
        for rel in graph_data.get("relationships", []):
            session.run(
                "MATCH (a {name: $f}), (b {name: $t}) MERGE (a)-[:" + rel["type"] + "]->(b)",
                f=rel["from"], t=rel["to"]
            )

text = "Mehta Electronics supplies Samsung QLED TVs to ShopMax India Mumbai stores. The warranty covers display defects for 2 years."
graph = extract_graph(text)
print(json.dumps(graph, indent=2))
load_to_neo4j(graph)
print("Loaded to Neo4j")

It gives the following output,

{
  "entities": [
    {"name": "Mehta Electronics", "type": "Supplier"},
    {"name": "Samsung QLED TV", "type": "Product"},
    {"name": "Mumbai", "type": "City"},
    {"name": "2-year display warranty", "type": "WarrantyTerm"}
  ],
  "relationships": [
    {"from": "Mehta Electronics", "type": "SUPPLIES", "to": "Samsung QLED TV"},
    {"from": "Samsung QLED TV", "type": "LOCATED_IN", "to": "Mumbai"},
    {"from": "2-year display warranty", "type": "COVERS", "to": "Samsung QLED TV"}
  ]
}
Loaded to Neo4j

LLM extraction is imperfect - run entity deduplication after loading by normalizing names to title case and merging near-duplicates. For ShopMax India, process documents in batches overnight and use gpt-4o-mini for cost efficiency since extraction is not latency-sensitive. Store the source document ID on each node so you can trace which document produced each entity. Validate the graph after each batch by checking that no relationship references a missing entity.

Send your comments, suggestions or queries regarding this site to [email protected].