Title Naujos kartos sekoskaitos duomenų automatinio analizės algoritmo bei skirtingų praturtinimo sistemų įvertinimas, panaudojant Sanger sekoskaitą /
Translation of Title Evaluation of automated next generation sequencing data analysis pipeline and different enrichment systems using sanger sequencing.
Authors Žukauskaitė, Gabrielė
Full Text Download
Pages 52
Abstract [eng] DNA sequencing is one of the main methods for determining the genome variants. Next generation sequencing (NGS) is increasingly used in various life science fields due to its high efficiency, the ability to sequence exome or even whole genome. However, Sanger sequencing is still used to determine genome variants. This method is considered more accurate and the data is easier to analyze. For these reasons, Sanger sequencing method is used for NGS data validation. Nowadays, a debate in the scientific world is held on whether NGS findings should be verified using Sanger sequencing or not. Also, researchers do not agree on the efficiency of different NGS enrichment systems – different publications present diverse point of view to enrichment systems, which reveals ambiguity of the aforementioned topic. The aim of this work was to evaluate automated NGS SOLiD platform’s data analysis pipeline, defining the accuracy, sensitivity and specificity values of the method and to evaluate used target enrichment systems (TargetSeq and SureSelect) using Sanger sequencing, statistical and bioinformatic tools. Sixty genome variants detected using NGS were investigated using Sanger sequencing and other analysis tools. Total test group consisted of 96 subjects. The purpose of this was to evaluate reliability of the SOLiD data analysis pipeline. It was detected that six variants (in SKA3, CRACR2A, ANKRD62, BMS1, BAGE2 and CFP genes) have been identified incorrectly or could not be identified at all because of Sanger sequencing limitations or errors. Meanwhile, the two variants (in TMPRSS15 and FAM105A genes) were erroneously identified by automated data analysis pipeline of NGS SOLiD platform. These results indicate that Sanger method has more flaws than NGS, when identifying variants and should not be considered as the "gold standard". NGS data was obtained using two different target enrichment systems - SureSelect and TargetSeq. They were compared by one of the main parameters of NGS – coverage. The mean coverage using SureSelect and TargetSeq enrichment systems were 32.77 and 31.58, respectively. Based on these values, SureSelect enrichment system’s coverage is 3.65 % higher. However, the nonparametric Wilcoxon test showed that difference between means is not statistically significant. Using data of two unrelated persons, NGS was performed using both enrichment systems. More genome variants were identified using the SureSelect enrichment system. However, TargetSeq system was able to identify unique variants. Consequently, the most effective way to accurately identify genome variants is to use both enrichment systems. The mean coverage was 35.82 % higher using SureSelect system compared to TargetSeq, although the sample size is too small to confirm the data by statistical analysis. Finally, the automated NGS SOLiD data analysis pipeline’s accuracy, sensitivity and specificity was estimated to be 99.66 %, 99.22 % and 99.7 %, respectively. Based on these results, it can be concluded that there is no need to verify NGS data using Sanger sequencing when automatic analysis algorithm parameters are high.
Dissertation Institution Vilniaus universitetas.
Type Master thesis
Language Lithuanian
Publication date 2017