Identifying and correcting bias in big crowd-sourced online genealogies
Michael Chong
1
,
Diego Alburez-Gutierrez
2
,
Emanuele Del Fava
2
,
Monica Alexander
1
,
Emilio Zagheni
2
1
University of Toronto,
2
Max Planck Institute for Demographic Research
Human societies have long valued genealogies as repositories of important historical information. However, historical research has shown that genealogies are flawed and should not be taken at face-value. In recent years, online communities have produced big genealogies connecting all continents over multiple centuries. We present the first attempt to characterize and account for systematic biases affecting the life events recorded in online genealogies. Using a Bayesian model and data from the Human Mortality Database (HMD), we document systematic under-reporting of mortality in four European countries, especially at young and old ages. Our out-of-sample estimates of age-specific mortality rates in Finland (1835-1900) are consistent with HMD estimates but slightly lower. Life expectancy at birth from our model (46.4 and 49.4 years) is also consistent with HMD values (44.6 and 48.3) for men and women in 1895 Finland. We are working on leveraging the network properties of genealogies to improve our models.