Skip to main content
Research

Machine learning predicts highest-risk groundwater sites to improve water quality monitoring

illustration shows a digital screen displaying data related to groundwater quality

An interdisciplinary team of researchers has developed a machine learning framework that uses limited water quality samples to predict which inorganic pollutants are likely to be present in a groundwater supply. The new tool allows regulators and public health authorities to prioritize specific aquifers for water quality testing.

This proof-of-concept work focused on Arizona and North Carolina but could be applied to fill critical gaps in groundwater quality in any region.

Groundwater is a source of drinking water for millions and often contains pollutants that pose health risks. However, many regions lack complete groundwater quality datasets.

“Monitoring water quality is time-consuming and expensive, and the more pollutants you test for, the more time-consuming and expensive it is,” says Yaroslava Yingling, co-corresponding author of a paper describing the work and Kobe Steel Distinguished Professor of Materials Science and Engineering at North Carolina State University.

“As a result, there is interest in identifying which groundwater supplies should be prioritized for testing, maximizing limited monitoring resources,” Yingling says. “We know that naturally occurring pollutants, such as arsenic or lead, tend to occur in conjunction with other specific elements due to geological and environmental factors. This posed an important data question: with limited water quality data for a groundwater supply, could we predict the presence and concentrations of other pollutants?”

“Along with identifying elements that pose a risk to human health, we also wanted to see if we could predict the presence of other elements – such as phosphorus – which can be beneficial in agricultural contexts but may pose environmental risks in other settings,” says Alexey Gulyuk, a co-first author of the paper and a teaching professor of materials science and engineering at NC State.

To address this challenge, the researchers drew on a huge data set, encompassing more than 140 years of water quality monitoring data for groundwater in the states of North Carolina and Arizona. Altogether, the data set included more than 20 million data points, covering more than 50 water quality parameters.

“We used this data set to ‘train’ a machine learning model to predict which elements would be present based on the available water quality data,” says Akhlak Ul Mahmood, co-first author of this work and a former Ph.D. student at NC State. “In other words, if we only have data on a handful of parameters, the program could still predict which inorganic pollutants were likely to be in the water, as well as how abundant those pollutants are likely to be.”

One key finding of the study is that the model suggests pollutants are exceeding drinking water standards in more groundwater sources than previously documented. While actual data from the field indicated that 75-80% of sampled locations were within safe limits, the machine learning framework predicts that only 15% to 55% of the sites may truly be risk-free.

“As a result, we’ve identified quite a few groundwater sites that should be prioritized for additional testing,” says Minhazul Islam, co-first author of the paper and a Ph.D. student at Arizona State University. “By identifying potential ‘hot spots,’ state agencies and municipalities can strategically allocate resources to high-risk areas, ensuring more targeted sampling and effective water treatment solutions”

“It’s extremely promising and we think it works well,” Gulyuk says. “However, the real test will be when we begin using the model in the real world and seeing if the prediction accuracy holds up.”

Moving forward, researchers plan to enhance the model by expanding its training data across diverse U.S. regions; integrating new data sources, such as environmental data layers, to address emerging contaminants; and conducting real-world testing to ensure robust, targeted groundwater safety measures worldwide.

“We see tremendous potential in this approach,” says Paul Westerhoff, co-corresponding author and Regents’ Professor in the School of Sustainable Engineering and the Built Environment at ASU. “By continuously improving its accuracy and expanding its reach, we’re laying the groundwork for proactive water safety measures across the globe.”

“This model also offers a promising tool for tracking phosphorus levels in groundwater, helping us identify and address potential contamination risks more efficiently,” says Jacob Jones, director of the National Science Foundation-funded Science and Technologies for Phosphorus Sustainability (STEPS) Center at NC State, which helped fund this work. “Looking ahead, extending this model to support broader phosphorus sustainability could have a significant impact, enabling us to manage this critical nutrient across various ecosystems and agricultural systems, ultimately fostering more sustainable practices.”

The paper, “Multiple Data Imputation Methods Advance Risk Analysis and Treatability of Co-occurring Inorganic Chemicals in Groundwater,” is published open access in the journal Environmental Science & Technology. The paper was co-authored by Emily Briese and Mohit Malu, both Ph.D. students at Arizona State; Carmen Velasco, a former postdoctoral researcher at Arizona State; Naushita Sharma, a postdoctoral researcher at Oak Ridge National Laboratory; and Andreas Spanias, a professor of digital signal processing at Arizona State.

This work was supported by the NSF STEPS Center; and by the Metals and Metal Mixtures: Cognitive Aging, Remediation and Exposure Sources (MEMCARE) Superfund Research Center based at Harvard University, which is supported by the National Institute of Environmental Health Science under grant P42ES030990.

-shipman-

Note to Editors: The study abstract follows.

“Multiple Data Imputation Methods Advance Risk Analysis and Treatability of Co-occurring Inorganic Chemicals in Groundwater”

Authors: Akhlak U. Mahmood, Alexey V. Gulyuk and Yaroslava G. Yingling, North Carolina State University; Minhazul Islam, Emily Briese, Carmen A. Velasco, Mohit Malu, Naushita Sharma, Andreas Spanias and Paul Westerhoff, Arizona State University

Published: Nov. 7, Environmental Science & Technology

DOI: 10.1021/acs.est.4c05203

Abstract: Accurately assessing and managing risks associated with inorganic pollutants in groundwater is imperative. Historic water quality databases are often sparse due to rationale or financial budgets for sample collection and analysis, posing challenges in evaluating exposure or water treatment effectiveness. We utilized and compared two advanced multiple data imputation techniques, AMELIA and MICE algorithms, to fill gaps in sparse groundwater quality data sets. AMELIA outperformed MICE in handling missing values, as MICE tended to overestimate certain values, resulting in more outliers. Field data sets revealed that 75% to 80% of samples exhibited no co-occurring regulated pollutants surpassing MCL values, whereas imputed values showed only 15% to 55% of the samples posed no health risks. Imputed data unveiled a significant increase, ranging from 2 to 5 times, in the number of sampling locations predicted to potentially exceed health-based limits and identified samples where 2 to 6 co-occurring chemicals may occur and surpass health-based levels. Linking imputed data to sampling locations can pinpoint potential hotspots of elevated chemical levels and guide optimal resource allocation for additional field sampling and chemical analysis. With this approach, further analysis of complete data sets allows state agencies authorized to conduct groundwater monitoring, often with limited financial resources, to prioritize sampling locations and chemicals to be tested. Given existing data and time constraints, it is crucial to identify the most strategic use of the available resources to address data gaps effectively. This work establishes a framework to enhance the beneficial impact of funding groundwater data collection by reducing uncertainty in prioritizing future sampling locations and chemical analyses.

This post was originally published in NC State News.