91����’s CA-AI Makes AI Smarter by Cleaning Up Bad Data Before It Learns

91��’s CA-AI Makes AI Smarter by Cleaning Up Bad Data Before It Learns

Smarter AI, AI

By gisele galoustian | 6/12/2025

In the world of machine learning and artificial intelligence, clean data is everything. Even a small number of mislabeled examples known as label noise can derail the performance of a model, especially those like Support Vector Machines (SVMs) that rely on a few key data points to make decisions.

SVMs are a widely used type of machine learning algorithm, applied in everything from image and speech recognition to medical diagnostics and text classification. These models operate by finding a boundary that best separates different categories of data. They rely on a small but crucial subset of the training data, known as support vectors, to determine this boundary. If these few examples are incorrectly labeled, the resulting decision boundaries can be flawed, leading to poor performance on real-world data.

Now, a team of researchers from the Center for Connected Autonomy and Artificial Intelligence (CA-AI) within the College of Engineering and Computer Science at 91�� and collaborators have developed an innovative method to automatically detect and remove faulty labels before a model is ever trained – making AI smarter, faster and more reliable. Before the AI even starts learning, the researchers clean the data using a math technique that looks for odd or unusual examples that don’t quite fit. These “outliers” are removed or flagged, making sure the AI gets high-quality information right from the start.

“SVMs are among the most powerful and widely used classifiers in machine learning, with applications ranging from cancer detection to spam filtering,” said Dimitris Pados, Ph.D., Schmidt Eminent Scholar Professor of Engineering and Computer Science in the 91�� Department of Electrical Engineering and Computer Science, director of CA-AI and an 91�� Sensing Institute (I-SENSE) faculty fellow. “What makes them especially effective – but also uniquely vulnerable – is that they rely on just a small number of key data points, called support vectors, to draw the line between different classes. If even one of those points is mislabeled – for example, if a malignant tumor is incorrectly marked as benign – it can distort the model’s entire understanding of the problem. The consequences of that could be serious, whether it’s a missed cancer diagnosis or a security system that fails to flag a threat. Our work is about protecting models – any machine learning and AI model including SVMs – from these hidden dangers by identifying and removing those mislabeled cases before they can do harm.”

The data-driven method that “cleans” the training dataset uses a mathematical approach called L1-norm principal component analysis. Unlike conventional methods, which often require manual parameter tuning or assumptions about the type of noise present, this technique identifies and removes suspicious data points within each class purely based on how well they fit with the rest of the group.

“Data points that appear to deviate significantly from the rest – often due to label errors – are flagged and removed,” said Pados. “Unlike many existing techniques, this process requires no manual tuning or user intervention and can be applied to any AI model, making it both scalable and practical.”

The process is robust, efficient and entirely touch-free – even handling the notoriously tricky task of rank selection (which determines how many dimensions to keep during analysis) without user input.

Researchers extensively tested their technique on real and synthetic datasets with various levels of label contamination. Across the board, it produced consistent and notable improvements in classification accuracy, demonstrating its potential as a standard pre-processing step in the development of high-performance machine learning systems.

“What makes our approach particularly compelling is its flexibility,” said Pados. “It can be used as a plug-and-play preprocessing step for any AI system, regardless of the task or dataset. And it’s not just theoretical – extensive testing on both noisy and clean datasets, including well-known benchmarks like the Wisconsin Breast Cancer dataset, showed consistent improvements in classification accuracy. Even in cases where the original training data appeared flawless, our new method still enhanced performance, suggesting that subtle, hidden label noise may be more common than previously thought.”

Looking ahead, the research opens the door to even broader applications. The team is interested in exploring how this mathematical framework might be extended to tackle deeper issues in data science such as reducing data bias and improving the completeness of datasets.

“As machine learning becomes deeply integrated into high-stakes domains like health care, finance and the justice system, the integrity of the data driving these models has never been more important,” said Stella Batalama, Ph.D., dean of the 91�� College of Engineering and Computer Science. “We’re asking algorithms to make decisions that impact real lives – diagnosing diseases, evaluating loan applications, even informing legal judgments. If the training data is flawed, the consequences can be devastating. That’s why innovations like this are so critical. By improving data quality at the source – before the model is even trained – we’re not just making AI more accurate; we’re making it more responsible. This work represents a meaningful step toward building AI systems we can trust to perform fairly, reliably and ethically in the real world.”

This work will appear in the Institute of Electrical and Electronics Engineers’ (IEEE), Transactions on Neural Networks and Learning Systems. Co-authors, who are all IEEE members, are Shruti Shukla; Ph.D. student in the CA-AI and the 91�� Department of Electrical Engineering and Computer Science; George Sklivanitis, Ph.D., Charles E. Schmidt Research Associate Professor in the CA-AI and the Department of Electrical Engineering and Computer Science, and I-SENSE faculty fellow; Elizabeth Serena Bentley, Ph.D.; and Michael J. Medley, Ph.D., United States Air Force Research Laboratory.

Dimitris Padso

Dimitris Pados, Ph.D., Schmidt Eminent Scholar Professor of Engineering and Computer Science in the 91�� Department of Electrical Engineering and Computer Science, director of CA-AI and an 91�� Sensing Institute (I-SENSE) faculty fellow.

-91��-

91����

91����’s CA-AI Makes AI Smarter by Cleaning Up Bad Data Before It Learns

91��

91��’s CA-AI Makes AI Smarter by Cleaning Up Bad Data Before It Learns