Authors:
(1) Yanpeng Ye, School of Computer Science and Engineering, University of New South Wales, Kensington, NSW, Australia, GreenDynamics Pty. Ltd, Kensington, NSW, Australia, and these authors contributed equally to this work;
(2) Jie Ren, GreenDynamics Pty. Ltd, Kensington, NSW, Australia, Department of Materials Science and Engineering, City University of Hong Kong, Hong Kong, China, and these authors contributed equally to this work;
(3) Shaozhou Wang, GreenDynamics Pty. Ltd, Kensington, NSW, Australia ([email protected]);
(4) Yuwei Wan, GreenDynamics Pty. Ltd, Kensington, NSW, Australia and Department of Linguistics and Translation, City University of Hong Kong, Hong Kong, China;
(5) Imran Razzak, School of Computer Science and Engineering, University of New South Wales, Kensington, NSW, Australia;
(6) Tong Xie, GreenDynamics Pty. Ltd, Kensington, NSW, Australia and School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW, Australia ([email protected]);
(7) Wenjie Zhang, School of Computer Science and Engineering, University of New South Wales, Kensington, NSW, Australia ([email protected]).
In this study, we introduce a new NLP pipeline for KG construction, which aim to efficiently extract the triples from unstructured scientific texts. The main feature of the method is that it can fine-tune the LLMs by annotating a small amount of data, and use the fine-tuned LLM to extract structured information from a large amount of unstructured text. The entire process does not rely on any prediction, which can maximize the authenticity and traceability of structured information. By employing this method, we construct a Functional Material Knowledge Graph (FMKG) contains the materials and their related knowledge from abstract of 150,000 peer-reviewed paper. After analyzing, we have demonstrated the effectiveness and credibility of FMKG.
In addition, our method and KG have great potential in different dimensions. Firstly, enhancing the depth of structured information extraction to encompass entire research papers promises a richer, more detailed knowledge graph. This involves not only expanding the scope of data analyzed but refining the process to capture nuances within complex scientific texts. Secondly, refining entity labels within our system allows for a more precise categorization of data, including incorporating detailed attributes such as synthesis conditions or property parameters, which significantly improves the granularity and utility of the knowledge graph. Thirdly, the versatility of our NLP pipeline suggests its applicability across different scientific domains, offering a template for constructing domain-specific knowledge graphs beyond material science. Lastly, integrating FMKG with existing knowledge graphs like MatKG opens avenues for creating a more interconnected and comprehensive dataset, facilitating advanced research and application development in material science and beyond.
Venugopal, V. & Olivetti, E. Matkg: An autonomously generated knowledge graph in material science. Sci. Data 11, 217 (2024).
Jain, A. et al. The materials project: A materials genome approach to accelerating materials innovation, apl mater. APL materials (2013).
Saal, J. E., Kirklin, S., Aykol, M., Meredig, B. & Wolverton, C. Materials design and discovery with high-throughput density functional theory: the open quantum materials database (oqmd). Jom 65, 1501–1509 (2013).
Draxl, C. & Scheffler, M. The nomad laboratory: from data sharing to artificial intelligence. J. Physics: Mater. 2, 036001 (2019).
Mrdjenovich, D. et al. Propnet: a knowledge graph for materials science. Matter 2, 464–480 (2020).
Ji, S., Pan, S., Cambria, E., Marttinen, P. & Yu, P. S. A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Transactions on Neural Networks Learn. Syst. 33, 494–514 (2022).
Zhang, J., Chen, B., Zhang, L., Ke, X. & Ding, H. Neural, symbolic and neural-symbolic reasoning on knowledge graphs. AI Open 2, 14–35 (2021).
Mitchell, T. et al. Never-ending learning. Commun. ACM 61, 103–115 (2018).
Zhong, L., Wu, J., Li, Q., Peng, H. & Wu, X. A comprehensive survey on automatic knowledge graph construction. ACM Comput. Surv. 56 (2023).
Pan, S. et al. Unifying large language models and knowledge graphs: A roadmap. IEEE Transactions on Knowl. Data Eng. 1–20 (2024).
Weston, L. et al. Named entity recognition and normalization applied to large-scale information extraction from the materials science literature. J. Chem. Inf. Model. 59, 3692–3702 (2019).
Zhang, X., Liu, X., Li, X. & Pan, D. Mmkg: An approach to generate metallic materials knowledge graph based on dbpedia and wikipedia. Comput. Phys. Commun. 211, 98–112 (2017).
Nie, Z. et al. Automating materials exploration with a semantic knowledge graph for li-ion battery cathodes. Adv. Funct. Mater. 32, 2201437 (2022).
An, Y. et al. Knowledge graph question answering for materials science (kgqa4mat): Developing natural language interface for metal-organic frameworks knowledge graph (mof-kg). arXiv preprint arXiv:2309.11361 (2023).
Venugopal, V. & Olivetti, E. MatKG-2: Unveiling precise material science ontology through autonomous committees of LLMs. AI for Accel. Mater. Des. – NeurIPS 2023 Work. (2023).
Su, P., Li, G., Wu, C. & Vijay-Shanker, K. Using distant supervision to augment manually annotated data for relation extraction. PLOS ONE 14 (2019).
Sousa, R. T., Silva, S. & Pesquita, C. Explainable representations for relation prediction in knowledge graphs. arXiv preprint arXiv:2306.12687 (2023).
Brown, T. et al. Language models are few-shot learners. Adv. neural information processing systems 33, 1877–1901 (2020).
Touvron, H. et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
Xie, T. et al. Creation of a structured solar cell material dataset and performance prediction using large language models. Patterns (2024).
Dagdelen, J. et al. Structured information extraction from scientific text with large language models. Nat. Commun. 15, 1418 (2024).
Swain, M. C. & Cole, J. M. Chemdataextractor: A toolkit for automated extraction of chemical information from the scientific literature. J. Chem. Inf. Model. 56, 1894–1904 (2016).
Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. NATURE 571, 95+ (2019).
Xie, T. et al. Darwin series: Domain specific large language models for natural science. arXiv preprint arXiv:2308.13565 (2023).