Usage of non-probability sample and scraped data to estimate proportions

Vilma Nekrašaitė-Liegė; Andrius Čiginas; Danutė Krapavickaitė

Title	Usage of non-probability sample and scraped data to estimate proportions
Authors	Nekrašaitė-Liegė, Vilma ; Čiginas, Andrius ; Krapavickaitė, Danutė
ISBN	9789985746301
Full Text
Is Part of	Baltic-Nordic-Ukrainian workshop on survey statistics 2022, August 23-26, 2022, Tartu, Estonia.. Tartu : Statistics Estonia, 2022. p. 51-52.. ISBN 9789985746301
Keywords [eng]	big data ; coverage bias ; post-stratification ; calibration weighting ; accuracy estimation
Abstract [eng]	An increasing amount of data sources suggests a task to integrate them with the ordinary data sources used in official statistics. One of the problems under the study at Statistics Lithuania is to revise some indicators and to find out if there is room for their accuracy improvement using data from additional sources. The proportion of companies possessing the websites is one such indicator. Traditionally it is estimated using the data of the Information and Communication Technology sample survey. Information about enterprise website possession is provided also by a private company. However, this data source is updated on a voluntary basis and has some drawbacks: it does not cover all the population, thus the estimator based on this data source should be biased (Tam and Kim, 2018). Another way to create a list of enterprises owing the websites is to do it by web scrapping (ESSnet Big Data I, ESSnet Big Data II). Following a common methodology, ten potential URLs are found for each enterprise applying a search engine to the population. A logistic regression model is used to estimate the probability, that the selected URL is a website of the particular enterprise. If this probability reaches the fixed threshold, then a conclusion, that the enterprise owns the website, is made. Otherwise, the conclusion is opposite. However, it is known from other research sources, that the accuracy of such an enterprise classification is around 59-89 percent truthful and depends on a search engine, training sample, etc. Therefore, it may seem that there is no possibility of renouncing the collection of the data on websites through the ICT survey, however, the combination of different sources may lead to more efficient estimators. See Beaumont (2020), Kim and Tam (2021) and Rao (2021) among others. In this research, the number of methods to integrate auxiliary data obtained from alternative sources with the survey data for bias adjustment is examined. The integration leads to more efficient estimators in comparison with the estimators based only on the survey data. The accuracy measures of the estimators considered are evaluated.
Published	Tartu : Statistics Estonia, 2022
Type	Conference paper
Language	English
Publication date	2022

„Usage of non-probability sample and scraped data to estimate proportions“