AI summaryⓘ
The authors address the challenge of finding and connecting important information buried in many polymer research papers. They build two systems to help search and understand this information better: one uses vectors to find similar text (VectorRAG), and the other uses a structured knowledge graph to link concepts (GraphRAG). Their tests show GraphRAG is more precise and easier to interpret, while VectorRAG covers a wider range of information. Expert review confirmed that these tools give reliable, well-supported answers, helping researchers compare studies and spot patterns more easily. Overall, the authors provide a practical way to create trustworthy helpers for materials science literature using custom datasets and retrieval strategies.
polyhydroxyalkanoateretrieval-augmented generationlarge language modelssemantic embeddingsknowledge graphentity disambiguationmulti-hop reasoninginformation retrievalmaterials sciencenatural language processing
Authors
Sonakshi Gupta, Akhlak Mahmood, Wei Xiong, Rampi Ramprasad
Abstract
Polymer literature contains a large and growing body of experimental knowledge, yet much of it is buried in unstructured text and inconsistent terminology, making systematic retrieval and reasoning difficult. Existing tools typically extract narrow, study-specific facts in isolation, failing to preserve the cross-study context required to answer broader scientific questions. Retrieval-augmented generation (RAG) offers a promising way to overcome this limitation by combining large language models (LLMs) with external retrieval, but its effectiveness depends strongly on how domain knowledge is represented. In this work, we develop two retrieval pipelines: a dense semantic vector-based approach (VectorRAG) and a graph-based approach (GraphRAG). Using over 1,000 polyhydroxyalkanoate (PHA) papers, we construct context-preserving paragraph embeddings and a canonicalized structured knowledge graph supporting entity disambiguation and multi-hop reasoning. We evaluate these pipelines through standard retrieval metrics, comparisons with general state-of-the-art systems such as GPT and Gemini, and qualitative validation by a domain chemist. The results show that GraphRAG achieves higher precision and interpretability, while VectorRAG provides broader recall, highlighting complementary trade-offs. Expert validation further confirms that the tailored pipelines, particularly GraphRAG, produce well-grounded, citation-reliable responses with strong domain relevance. By grounding every statement in evidence, these systems enable researchers to navigate the literature, compare findings across studies, and uncover patterns that are difficult to extract manually. More broadly, this work establishes a practical framework for building materials science assistants using curated corpora and retrieval design, reducing reliance on proprietary models while enabling trustworthy literature analysis at scale.