New study may help uncover the true impact of childhood lead exposure

Data scientist Joe Feldman developed sophisticated statistical models to better understand the link between childhood lead exposure and standardized test scores.

Lead exposure in childhood may be even more dangerous for cognitive development and school performance than previously thought, according to a new analysis led by Joe Feldman, an assistant professor of statistics and data science.

Lead exposure in children most often comes from deteriorating lead-based paint, contaminated soil, or old water pipes — hazards that remain in many U.S. communities.

Joe Feldman

High levels of lead in a child’s bloodstream have long been known to impair intellectual ability. But like many other real-world datasets, the data establishing the link between lead exposure and cognitive development are messy and incomplete. “It’s clear that lead is dangerous,” Feldman said. “But the magnitude of that association has been hard to estimate because many children are never tested for exposure, which means many data points are missing.”

To better understand the risk, Feldman and colleagues — Jerome Reiter of Duke University and WashU alum Daniel Kowal (AB '12), now at Cornell University — analyzed data from 170,000 fourth-grade students from North Carolina with the goal of linking lead exposure to end-of-grade standardized test scores. “Although standardized test scores are a flawed metric, they are important proxies for child development and are strongly correlated to academic milestones in high school and beyond,” Feldman said.

Complicating the analysis, data on lead exposure were missing for about 35% of these children because the state of North Carolina only mandates testing if a child is thought to be at risk, perhaps because their house or neighborhood has lead pipes. “The missing values for lead exposure aren’t random,” Feldman said. “In statistics, we call this type of missing data ‘nonignorable.’ We have to address these gaps to see the full picture.”

As reported in the journal Bayesian Analysis, the team used sophisticated statistical tools to reach an unsettling conclusion: If all kids had been checked for lead levels, the association between lead exposure and academic test scores would be even stronger than previously suspected. “We used our model to predict missing lead values to form complete datasets. When we analyzed these completed datasets, we found a significantly stronger relationship between lead exposure and test scores,” Feldman said. “We seem to have been underestimating the adverse impact of lead exposure on childhood educational achievement.”

To estimate lead levels in students who hadn’t been screened, the researchers consulted published statistics on population-level lead exposure in children from the Centers for Disease Control and Prevention. They then used Bayesian statistical modeling — a type of analysis often used to draw conclusions from incomplete datasets — to fill in the missing lead measurements. “Our model essentially balances the information in the observed data with the published CDC statistics, which helps inform plausible predictions for the missing values,” Feldman said. 

The study highlights the need for broader lead testing and measures to reduce exposure. It also shows the value of revisiting incomplete data. “Bayesian analysis is powerful because it allows us to account for the uncertainty caused by missing data. However, models can only learn from observed data,” Feldman said. “Building a statistical model that can simultaneously leverage unobserved information while also accounting for the other complexities in the data was a serious challenge.”

Feldman is applying similar tools to evaluate the effectiveness of medical treatments for depression. “Electronic health records provide a trove of information, but the data are very messy and incomplete,” he said. If a patient responds well to medication, their doctor may stop measuring or recording their symptoms, leaving gaps. Simultaneously, there is abundant external information – from clinical trials and other analyses – on the efficacy of different medicines. “We’re trying to develop models that can integrate this external information to better understand the missing data,” he said.

The same general approach could help clarify many other questions that are complicated by missing data. “Statistical models should not be constrained by the lack of information in a particular dataset,” Feldman said. “Our work allows users to easily integrate external information to improve decision-making and public health strategies.”

Header image credit: Rubidium Beach/Pexels