Title Hiperparametrų optimizavimo metodai virpesinės spektrometrijos duomenų analizėje /
Translation of Title Hyperparameter optimization methods for vibrational spectroscopy data analysis.
Authors Stakauskas, Brendonas
Full Text Download
Pages 66
Abstract [eng] Data analysis of vibrational spectroscopy requires a deep understanding of both spectroscopy and data analysis fields. Spectroscopy data may contain unwanted properties (e.g. noise, data scattering). This characteristic makes it harder to conduct data analysis experiments for the dataset. To clean out the data of those unwanted attributes, one can use various methods that may require additional parameters. Spectroscopic data contains many collinear properties so to properly use this kind of data for analysis one must pick important features and machine learning model properly. Data preprocessing, important variables selection, and machine learning models make up the whole data analysis pipeline. The pipeline parameters – method combinations, methods place in the pipeline, method parameters – can cause a combinatorial explosion, which makes it hard to find a sufficient pipeline for the given task. The aim of this master's thesis – to find a method that is suitable for automatic hyperparameter search of analysis models for vibrational spectroscopy data. Methods discussed in this master's thesis are based on genetic optimization (TPOT) and random search of neural network architecture (AutoKeras). The main focus of this work was methods that are used in vibrational spectroscopy data analysis. Optimization tasks were built by using various combinations of these methods and tweaking the genetic search task parameters as well. Later research was conducted by using more generic machine learning models (e.g. decision trees, k-NN) as a subset for the pipeline search. This search was conducted not on the whole dataset but only on the features that were kept after applying variable selection algorithm. The last piece of research was carried out on neural networks – by training some simple CNN model and comparing it with the one random search can find. The datasets used in this work were picked from published articles, which allows for meaningful result comparison. Datasets included MIR (FT-IR) spectra of fruit purees (classification – 0.9350 accuracy), Raman spectra of tablets (regression – 0.56 RMSE), and NIR spectra of frozen and thawed chicken (classification – 0.8760 accuracy). It was found that using uninformative variable elimination algorithm and TPOT (using a search space of basic machine learning methods) can lead to building better models (purees, 0.9573 accuracy, tablets, 0.2769 RMSE). An optional step has been discovered which allows building a good pipeline for the chicken dataset (0.9333 accuracy). Compared with TPOT, the results obtained with the AutoKeras tool are poor or negligible. Although the computing time of the search was not evaluated in this work, more complex models were not considered due to higher training times. The feasibility of search parallelization should be explored. Successful parallelization could lead to the applicable inclusion of more complex machine learning methods in the search space.
Dissertation Institution Vilniaus universitetas.
Type Master thesis
Language Lithuanian
Publication date 2021