91先生

91先生鈥檚 CA-AI Makes AI Smarter by Cleaning Up Bad Data Before It Learns

Smarter AI, AI


By gisele galoustian | 6/12/2025

In the world of machine learning and artificial intelligence, clean data is everything. Even a small number of mislabeled examples known as label noise can derail the performance of a model, especially those like Support Vector Machines (SVMs) that rely on a few key data points to make decisions.聽

SVMs are a widely used type of machine learning algorithm, applied in everything from image and speech recognition to medical diagnostics and text classification. These models operate by finding a boundary that best separates different categories of data. They rely on a small but crucial subset of the training data, known as support vectors, to determine this boundary. If these few examples are incorrectly labeled, the resulting decision boundaries can be flawed, leading to poor performance on real-world data.

Now, a team of researchers from the Center for Connected Autonomy and Artificial Intelligence (CA-AI) within the College of Engineering and Computer Science at 91先生 and collaborators have developed an innovative method to automatically detect and remove faulty labels before a model is ever trained 鈥 making AI smarter, faster and more reliable. Before the AI even starts learning, the researchers clean the data using a math technique that looks for odd or unusual examples that don鈥檛 quite fit. These 鈥渙utliers鈥 are removed or flagged, making sure the AI gets high-quality information right from the start.

鈥淪VMs are among the most powerful and widely used classifiers in machine learning, with applications ranging from cancer detection to spam filtering,鈥 said聽Dimitris Pados, Ph.D., Schmidt Eminent Scholar Professor of Engineering and Computer Science in the 91先生聽Department of Electrical Engineering and Computer Science, director of CA-AI and an 91先生 Sensing Institute (I-SENSE) faculty fellow. 鈥淲hat makes them especially effective 鈥 but also uniquely vulnerable 鈥 is that they rely on just a small number of key data points, called support vectors, to draw the line between different classes. If even one of those points is mislabeled 鈥 for example, if a malignant tumor is incorrectly marked as benign 鈥 it can distort the model鈥檚 entire understanding of the problem. The consequences of that could be serious, whether it鈥檚 a missed cancer diagnosis or a security system that fails to flag a threat. Our work is about protecting models 鈥 any machine learning and AI model including SVMs 鈥 from these hidden dangers by identifying and removing those mislabeled cases before they can do harm.鈥

The data-driven method that 鈥渃leans鈥 the training dataset uses a mathematical approach called L1-norm principal component analysis. Unlike conventional methods, which often require manual parameter tuning or assumptions about the type of noise present, this technique identifies and removes suspicious data points within each class purely based on how well they fit with the rest of the group.

鈥淒ata points that appear to deviate significantly from the rest 鈥 often due to label errors 鈥 are flagged and removed,鈥 said Pados. 鈥淯nlike many existing techniques, this process requires no manual tuning or user intervention and can be applied to any AI model, making it both scalable and practical.鈥

The process is robust, efficient and entirely touch-free 鈥 even handling the notoriously tricky task of rank selection (which determines how many dimensions to keep during analysis) without user input.

Researchers extensively tested their technique on real and synthetic datasets with various levels of label contamination. Across the board, it produced consistent and notable improvements in classification accuracy, demonstrating its potential as a standard pre-processing step in the development of high-performance machine learning systems.聽

鈥淲hat makes our approach particularly compelling is its flexibility,鈥 said Pados. 鈥淚t can be used as a plug-and-play preprocessing step for any AI system, regardless of the task or dataset. And it鈥檚 not just theoretical 鈥 extensive testing on both noisy and clean datasets, including well-known benchmarks like the Wisconsin Breast Cancer dataset, showed consistent improvements in classification accuracy. Even in cases where the original training data appeared flawless, our new method still enhanced performance, suggesting that subtle, hidden label noise may be more common than previously thought.鈥

Looking ahead, the research opens the door to even broader applications. The team is interested in exploring how this mathematical framework might be extended to tackle deeper issues in data science such as reducing data bias and improving the completeness of datasets.

鈥淎s machine learning becomes deeply integrated into high-stakes domains like health care, finance and the justice system, the integrity of the data driving these models has never been more important,鈥 said Stella Batalama, Ph.D., dean of the 91先生 College of Engineering and Computer Science. 鈥淲e鈥檙e asking algorithms to make decisions that impact real lives 鈥 diagnosing diseases, evaluating loan applications, even informing legal judgments. If the training data is flawed, the consequences can be devastating. That鈥檚 why innovations like this are so critical. By improving data quality at the source 鈥 before the model is even trained 鈥 we鈥檙e not just making AI more accurate; we鈥檙e making it more responsible. This work represents a meaningful step toward building AI systems we can trust to perform fairly, reliably and ethically in the real world.鈥

This work will appear in the Institute of Electrical and Electronics Engineers鈥 (IEEE), Transactions on Neural Networks and Learning Systems. Co-authors, who are all IEEE members, are Shruti Shukla; Ph.D. student in the CA-AI and the 91先生 Department of Electrical Engineering and Computer Science; George Sklivanitis, Ph.D., Charles E. Schmidt Research Associate Professor in the CA-AI and the Department of Electrical Engineering and Computer Science, and I-SENSE faculty fellow; Elizabeth Serena Bentley, Ph.D.; and Michael J. Medley, Ph.D., United States Air Force Research Laboratory.聽

Dimitris Padso

Dimitris Pados, Ph.D., Schmidt Eminent Scholar Professor of Engineering and Computer Science in the 91先生 Department of Electrical Engineering and Computer Science, director of CA-AI and an 91先生 Sensing Institute (I-SENSE) faculty fellow.

-91先生-