Towards Knowledge Graph Construction From Unstructured Text with LLMs, Triple Identification and Alignment to Wikidata

Salman, Muhammad

Towards Knowledge Graph Construction From Unstructured Text with LLMs, Triple Identification and Alignment to Wikidata

Date

2025

Authors

Salman, Muhammad

Abstract

The exponential growth of digital text has underscored the pivotal role of Knowledge Graphs (KGs) in structuring, managing, and deriving value from unstructured data. However, a vast portion of textual content remains unstructured, posing critical challenges for the automatic construction and enrichment of KGs, particularly in the accurate extraction and linking of knowledge triples. This thesis addresses these challenges by presenting a comprehensive framework for extracting high-quality subject-predicate-object triples from unstructured text and linking them to a structured KG. To tackle the complexity of natural language input, a novel preprocessing technique, Controlled Syntactic Simplification (CSynSim), is introduced. CSynSim analyses the syntactic structure of input sentences and applies a controlled "split and rewrite" strategy to simplify complex constructions while preserving semantic fidelity. To support systematic evaluation for triple extraction task, a crowdsourced benchmark dataset, TinyButMighty, is developed. It comprises richly annotated compound and complex sentences, validated by expert ontologists. This dataset serves as a standard for assessing triple extraction systems, supported by a tailored triple similarity based precision, recall, and f-measure metrics to evaluate the alignment between system outputs and human-annotated ground truth. Building on this foundation, the thesis proposes a baseline approach Doc-KG, a rule-based pipeline that employs traditional NLP tools, such as semantic dependency parsing and SPARQL querying, to extract and align triples with Wikidata entities. Doc-KG includes a predicate mapping mechanism and is evaluated both quantitatively and qualitatively against existing approaches. The advent of LLMs such as GPTs marked a significant shift in our methodology, their integration in enhancing the capacity to address limitations of previous Doc-KG. The thesis introduces SALMON (Syntactically Analysed and LLM-Optimised Natural language), a hybrid framework that leverages LLMs for both triple extraction and entity linking. The integration of CSynSim as a preprocessing step significantly improves extraction accuracy, especially in structurally complex sentences. To address hallucination and ambiguity in LLM outputs, the thesis further proposes an LLM-SPARQL hybrid framework for Named Entity Linking (NEL) and Disambiguation (NED). This approach harnesses SPARQL queries to retrieve candidate entities and employs LLMs to refine disambiguation decisions, thereby enhancing precision and reliability in linking textual mentions to Wikidata identifiers. In summary, this thesis presents an end-to-end pipeline that integrates syntactic simplification, rule-based processing, and large language models to advance the state of knowledge extraction from unstructured text. The resulting SALMON framework (https://w3id.org/salmon/) offers a modular and extensible approach to KG construction, with strong empirical performance and broader implications for natural language understanding, semantic technologies, and information integration.