Video available here.
Predicting life outcomes is a challenging task even for advanced machine learning (ML) algorithms. At the same time, accurately predicting these outcomes has important implications in providing targeted assistance and in improving policy making. Recent studies based on Fragile Families and Child Wellbeing Study dataset have shown that complex ML pipelines even in the presence of thousands of variables produce low quality predictions. This research raises several questions about the predictability of life outcomes: 1) What factors influence the predictability of an outcome (e.g., quality of data, pre-processing steps, model hyperparameters etc.) 2) How does the predictability of outcomes vary by domain (e.g., are health outcomes easier to predict than education outcomes)? To answer these questions, we are building a cloud-based system to train and test hundreds of ML pipelines on thousands of life outcomes. We use the results of this large-scale exploration in a data-driven way to understand the predictability of life outcomes.
In the first part of the talk, we discuss the study design and describe the system we built to run such a large-scale exploration. This system is both general and has easy to use interfaces to run a wide range of studies. In the second part, we present a meta-learning inspired method to derive key insights related to the problem of predictability by A) Comparing the relative predictive power of different classes of models B) Using descriptive statistics that best predict the predictability of ML pipelines. Predictability of life outcomes is a multi-faceted problem. We conclude the talk by briefly discussing some of our other studies that are currently in the pipeline.
Pranay Anchuri is a data scientist supported by the DataX fund at CITP. His research interests include graph mining, large-scale data analytics and blockchain technologies. Pranay graduated with a Ph.D. in computer science from Rensselaer Polytechnic Institute in 2015. During graduate studies, he worked at various labs including IBM, Yahoo, and QCRI. His thesis focused on developing algorithms for efficiently extracting frequent patterns noisy networks.
After graduation, Pranay started as a research scientist at NEC Labs, Princeton working on log modeling and analytics. Most recently, he worked as a research scientist at Axoni, NY where his research focused on problems related to the implementation of high-performance permissioned blockchains.
This seminar is co-sponsored by CITP and the Center for Statistics and Machine Learning.