An unlikely scandal engulfed the British government last month. After COVID-19 forced the government to cancel the “A-level” exams that help determine university admission, the British education regulator used an algorithm to predict what score each student would have received on their exam. The algorithm relied in part on how the school’s students had historically fared on the exam. Schools with richer children tended to have better track records, so the algorithm gave affluent students — even those on track for the same grades as poor students — much higher predicted scores. High-achieving, low-income pupils whose schools had not previously performed well were hit particularly hard. After threats of legal action and widespread demonstrations, the government backed down and scrapped the algorithmic grading process entirely. This wasn’t an isolated incident: In the United States, similar issues plagued the International Baccalaureate exam, which used an opaque artificial intelligence system to set students' scores, prompting protests from thousands of students and parents.
These episodes highlight some of the pitfalls of algorithmic decision-making. As technology advances, companies, governments, and other organizations are increasingly relying on algorithms to predict important social outcomes, using them to allocate jobs, forecast crime, and even try to prevent child abuse. These technologies promise to increase efficiency, enable more targeted policy interventions, and eliminate human imperfections from decision-making processes. But critics worry that opaque machine learning systems will in fact reflect and further perpetuate shortcomings in how organizations typically function — including by entrenching the racial, class, and gender biases of the societies that develop these systems. When courts and parole boards have used algorithms to forecast criminal behavior, for example, they have inaccurately identified Black defendants as future criminals more often than their white counterparts. Predictive policing systems, meanwhile, have led the police to unfairly target neighborhoods with a high proportion of non-white people, regardless of the true crime rate in those areas. Companies that have used recruitment algorithms have found that they amplify bias against women.
But there is an even more basic concern about algorithmic decision-making. Even in the absence of systematic class or racial bias, what if algorithms struggle to make even remotely accurate predictions about the trajectories of individuals' lives? That concern gains new support in a recent paper published in the Proceedings of the National Academy of Sciences. The paper describes a challenge, organized by a group of sociologists at Princeton University, involving 160 research teams from universities across the country and hundreds of researchers in total, including one of the authors of this article. These teams were tasked with analyzing data from the Fragile Families and Child Wellbeing Study, an ongoing study that measures various life outcomes for thousands of families who gave birth to children in large American cities around 2000. It is one of the richest data sets available to researchers: It tracks thousands of families over time, and has been used in more than 750 scientific papers.
The task for the teams was simple. They were given access to almost all of this data and asked to predict several important life outcomes for a sample of families. Those outcomes included the child’s grade point average, their “grit” (a commonly used measure of passion and perseverance), whether the household would be evicted, the material hardship of the household, and whether the parent would lose their job.
The teams could draw on almost 13,000 predictor variables for each family, covering areas such as education, employment, income, family relationships, environmental factors, and child health and development. The researchers were also given access to the outcomes for half of the sample, and they could use this data to hone advanced machine-learning algorithms to predict each of the outcomes for the other half of the sample, which the organizers withheld. At the end of the challenge, the organizers scored the 160 submissions based on how well the algorithms predicted what actually happened in these people’s lives.
The results were disappointing. Even the best performing prediction models were only marginally better than random guesses. The models were rarely able to predict a student’s GPA, for example, and they were even worse at predicting whether a family would get evicted, experience unemployment, or face material hardship. And the models gave almost no insight into how resilient a child would become.
In other words, even having access to incredibly detailed data and modern machine learning methods designed for prediction did not enable the researchers to make accurate forecasts. “The results of the Fragile Families Challenge,” the authors conclude, with notable understatement, “raise questions about the absolute level of predictive performance that is possible for some life outcomes, even with a rich data set.”
Of course, machine learning systems may be much more accurate in other domains; this paper studied the predictability of life outcomes in only one setting. But the failure to make accurate predictions cannot be blamed on the failings of any particular analyst or method. Hundreds of researchers attempted the challenge, using a wide range of statistical techniques, and they all failed.
These findings suggest that we should doubt that “big data” can ever perfectly predict human behavior — and that policymakers working in criminal justice policy and child-protective services should be especially cautious. Even with detailed data and sophisticated prediction techniques, there may be fundamental limitations on researchers' ability to make accurate predictions. Human behavior is inherently unpredictable, social systems are complex, and the actions of individuals often defy expectations.
And yet disappointing as this may be for technocrats and data scientists, it also suggests something reassuring about human potential. If life outcomes are not firmly pre-determined — if an algorithm, given a set of past data points, cannot predict a person’s trajectory — then the algorithm’s limitations ultimately reflect the richness of humanity’s possibilities.
Bryan Schonfeld and Sam Winter-Levy are PhD candidates in politics at Princeton University.