Estimation of rare event prevalence in online platforms with imperfect ml and human moderation

Vytautas Dominykas Leipus

Title	Estimation of rare event prevalence in online platforms with imperfect ml and human moderation
Translation of Title	Retų reiškinių paplitimo įvertinimas interneto platformose naudojant netobulą mašininio mokymosi ir žmogaus moderavimo sistemą.
Authors	Leipus, Vytautas Dominykas
Full Text
Pages	68
Keywords [eng]	Paplitimo įvertinimas, Reti įvykiai, Žmogaus dalyvavimu grįsta moderacija, Prevalence estimation, Rare events, Bias, Human-in-the-loop moderation
Abstract [eng]	Estimating prevalence under rare event conditions is a persistent challenge in largescale content moderation and monitoring systems, where observations are often collected through nonrandom, model assisted sampling pipelines. This thesis investigates how design choices across such pipelines, spanning sampling strategies, classifier performance, and decision thresholds, jointly affect prevalence estimation accuracy. Using a simulation based framework, the estimation process is decomposed into modular components and their contributions to bias and variance are systematically analyzed. Multiple sampling schemes, including simple random sampling and classifier assisted approaches with human moderation, are evaluated across controlled parameter grids that reflect realistic operational constraints. Rather than optimizing a single configuration, the study focuses on characterizing tradeoffs and interaction effects between components. Performance is assessed using bias, variance, and error decomposition metrics, enabling comparison of how uncertainty propagates through the pipeline. Results indicate that targeted sampling can substantially reduce variance but may introduce nonnegligible bias when classifier error rates or decision thresholds are misaligned with true prevalence levels. More conservative designs produce stable but less efficient estimates. Overall, the findings demonstrate that no single parameter choice is universally optimal. Instead, reliable prevalence estimation depends on coherent coordination between sampling and modeling decisions. The proposed framework offers a structured approach for analyzing such systems and provides practical guidance for designing robust prevalence estimation pipelines under rare event conditions.
Dissertation Institution	Vilniaus universitetas.
Type	Master thesis
Language	English
Publication date	2026

„Estimation of rare event prevalence in online platforms with imperfect ml and human moderation“