This week’s talk will be given by Sayash Kapoor.
The use of Machine Learning (ML) methods for prediction and forecasting has become widespread across the quantitative sciences. However, there are many known methodological pitfalls in applied ML research. We examine the use of ML methods in the subfield of civil war prediction in Political Science. Several recent studies published in top Political Science journals claiming superior performance of ML models over Logistic Regression models fail to reproduce due to data leakage. Results identifying pitfalls in studies that use ML methods have appeared in at least 17 quantitative science fields. We argue that there is a reproducibility crisis brewing in research fields that use ML methods, and that the main vector of irreproducibility is data leakage. We provide a taxonomy of data leakage and discuss how it has led to irreproducible results in ML research. We conclude with open questions about how leakage is defined and what makes ML results legitimate.