Identifying and correcting bias in big crowd-sourced online genealogies
Michael Chong1, Diego Alburez-Gutierrez2, Emanuele Del Fava2, Monica Alexander1, Emilio Zagheni2
1University of Toronto, 2Max Planck Institute for Demographic Research
Human societies have long valued genealogies as repositories of important historical information. However, historical research has shown that genealogies are flawed and should not be taken at face-value. In recent years, online communities have produced big genealogies connecting all continents over multiple centuries. We present the first attempt to characterize and account for systematic biases affecting the life events recorded in online genealogies. Using a Bayesian model and data from the Human Mortality Database (HMD), we document systematic under-reporting of mortality in four European countries, especially at young and old ages. Our out-of-sample estimates of age-specific mortality rates in Finland (1835-1900) are consistent with HMD estimates but slightly lower. Life expectancy at birth from our model (46.4 and 49.4 years) is also consistent with HMD values (44.6 and 48.3) for men and women in 1895 Finland. We are working on leveraging the network properties of genealogies to improve our models.