| Abstract [eng] |
Estimating prevalence under rare event conditions is a persistent challenge in largescale content moderation and monitoring systems, where observations are often collected through nonrandom, model assisted sampling pipelines. This thesis investigates how design choices across such pipelines, spanning sampling strategies, classifier performance, and decision thresholds, jointly affect prevalence estimation accuracy. Using a simulation based framework, the estimation process is decomposed into modular components and their contributions to bias and variance are systematically analyzed. Multiple sampling schemes, including simple random sampling and classifier assisted approaches with human moderation, are evaluated across controlled parameter grids that reflect realistic operational constraints. Rather than optimizing a single configuration, the study focuses on characterizing tradeoffs and interaction effects between components. Performance is assessed using bias, variance, and error decomposition metrics, enabling comparison of how uncertainty propagates through the pipeline. Results indicate that targeted sampling can substantially reduce variance but may introduce nonnegligible bias when classifier error rates or decision thresholds are misaligned with true prevalence levels. More conservative designs produce stable but less efficient estimates. Overall, the findings demonstrate that no single parameter choice is universally optimal. Instead, reliable prevalence estimation depends on coherent coordination between sampling and modeling decisions. The proposed framework offers a structured approach for analyzing such systems and provides practical guidance for designing robust prevalence estimation pipelines under rare event conditions. |