Title Cheminių junginių kristalografinės ir kristalocheminės informacijos išgavimas iš mokslinių straipsnių /
Translation of Title Extracting crystallographic information about chemical compounds from research papers.
Authors Kovtun, Tomas
Full Text Download
Pages 49
Abstract [eng] This work focuses on automating the extraction of crystallographic information from scientific articles. The work reviews applicable natural language processing methods and existing solutions for similar tasks. Additionally, the document presents a new compound-parameter association method based on extractive question answering models. Moreover, the document describes a method that automatically annotates training and validation datasets based on data from the Crystallography Open Database. Three implementations of language models of the BERT architecture were trained using the obtained datasets. The pipeline based on the BioBERT language model showed the best precision rates (91.9 precision and 63.7 recall). Although the resulting solution extracts the parameters related to the compound with sufficient precision, there are still several aspects of the crystallographic texts that are not properly addressed (e.g. presentation of the crystallographic parameters under different pressure and temperature conditions). Some of these aspects can be handled by further expansion and improvement of the question-answer model.
Dissertation Institution Vilniaus universitetas.
Type Master thesis
Language Lithuanian
Publication date 2023