tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Generative AI > Graph RAG > Building a Knowledge Graph from Documents with LLM Extraction

Building a Knowledge Graph from Documents with LLM Extraction

Author: Venkata Sudhakar

ShopMax India has hundreds of product manuals, warranty documents, and supplier agreements stored as text files. To enable Graph RAG, these documents must be converted into a structured knowledge graph of entities and relationships. An LLM automatically extracts entities like products, suppliers, warranty terms, and cities, and defines the relationships between them.

The extraction pipeline works in three steps: chunk the documents, prompt the LLM to extract entities and relationships from each chunk as JSON, and load the results into Neo4j. The LLM identifies entity types (Product, Supplier, City, WarrantyTerm) and relationship types (SUPPLIES, COVERS, LOCATED_IN). Duplicate entities are merged using MERGE statements in Cypher.

The below example shows how ShopMax India extracts a knowledge graph from product warranty documents using OpenAI with JSON mode and loads it into Neo4j.


It gives the following output,

{
  "entities": [
    {"name": "Mehta Electronics", "type": "Supplier"},
    {"name": "Samsung QLED TV", "type": "Product"},
    {"name": "Mumbai", "type": "City"},
    {"name": "2-year display warranty", "type": "WarrantyTerm"}
  ],
  "relationships": [
    {"from": "Mehta Electronics", "type": "SUPPLIES", "to": "Samsung QLED TV"},
    {"from": "Samsung QLED TV", "type": "LOCATED_IN", "to": "Mumbai"},
    {"from": "2-year display warranty", "type": "COVERS", "to": "Samsung QLED TV"}
  ]
}
Loaded to Neo4j

LLM extraction is imperfect - run entity deduplication after loading by normalizing names to title case and merging near-duplicates. For ShopMax India, process documents in batches overnight and use gpt-4o-mini for cost efficiency since extraction is not latency-sensitive. Store the source document ID on each node so you can trace which document produced each entity. Validate the graph after each batch by checking that no relationship references a missing entity.


 
  


  
bl  br