Authors:
(1) Yanpeng Ye, School of Computer Science and Engineering, University of New South Wales, Kensington, NSW, Australia, GreenDynamics Pty. Ltd, Kensington, NSW, Australia, and these authors contributed equally to this work;
(2) Jie Ren, GreenDynamics Pty. Ltd, Kensington, NSW, Australia, Department of Materials Science and Engineering, City University of Hong Kong, Hong Kong, China, and these authors contributed equally to this work;
(3) Shaozhou Wang, GreenDynamics Pty. Ltd, Kensington, NSW, Australia ([email protected]);
(4) Yuwei Wan, GreenDynamics Pty. Ltd, Kensington, NSW, Australia and Department of Linguistics and Translation, City University of Hong Kong, Hong Kong, China;
(5) Imran Razzak, School of Computer Science and Engineering, University of New South Wales, Kensington, NSW, Australia;
(6) Tong Xie, GreenDynamics Pty. Ltd, Kensington, NSW, Australia and School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW, Australia ([email protected]);
(7) Wenjie Zhang, School of Computer Science and Engineering, University of New South Wales, Kensington, NSW, Australia ([email protected]).
Editor’s note: This article is part of a broader study. You’re reading Part 3 of 9. Read the rest below.
Table of Links
Data preparation and schema design
Material experts undertook the annotation of nine distinct categories from 75 abstracts of research papers, act as LLM’s training dataset. These categories encompassed core labels – “Name,” “Formula,” or “Acronym” – which consistently represent specific materials across diverse papers, albeit in varied forms. Additionally, supplementary labels included “Descriptor,” “Structure/Phase,” “Application,” “Property,” “Synthesis,” and “Characterization.” The “Property” label served to delineate quantitative properties such as ’length,’ ’specific surface area,’ and ’mass,’ whereas the “Descriptor” label encapsulated qualitative attributes like ’stable,’ ’vertically,’ and ’safe.’ Since from the application aspects, developers always concern about its own properties, we also annotated the essential parameters as “Property” directly linked to specific “Application”. For instance, attributes like ’specific capacity’ for lithium-ion batteries, ’energy conversion efficiency’ for solar cells, and ’H2 production rate’ for hydrogen evolution reaction were systematically annotated.
As the Figure 2 shows, the central node, “Material,” is pivotal and connects to nodes that describe its nomenclature, composition, and various attributes. Specifically, “Material” is consisted of “Formula”, “Name” and “Acronym”. “Structure/Phase” describing its physical form; “Application” denoting its practical uses; and “Property” outlining its inherent characteristics. The “Material” node is also associated with “Descriptor” node that provide additional qualitative information. Moreover, the “Application” node branches out to “Property”, “Descriptor” and a “Domain” suggesting further specification and contextual relevance of application. To save the source of nodes and relations, each node is connected to a “Digital Object Identifier (DOI)” node. If we want to determine the source article from which the relation is derived, we can obtain the result by querying the intersection of “DOI” node neighbors at both ends of the relation.
Given the inherent complexity of sentences and the variability in terminologies across abstracts, a normalization process was employed after the initial extraction. This normalization ensured uniform representation of entities with similar meanings. For example, terms such as ’Lithium-Ion Battery’ and ’Li-ion batteries’ were standardized to ’lithium-ion battery,’ while phrases like ’solution casting method,’ ’solvent post-treatment method,’ and ’solution-based deposition’ were simplified to ’solution-processed.’ All annotated entities underwent this normalization process, ensuring consistency and facilitating effective training of LLMs. The field of functional materials encompasses a wide range of areas. At this stage, our priority is to focus on energy materials. We download 150,000 abstracts of peer-reviewed research articles of energy material science includes battery, solar cell and catalyst from the Web of Science. Each abstract was stored in a JSON file format, structured as “DOI – text,” facilitating seamless processing and analysis.