Multi-Model and Assertion-Based Confidence Scoring to Improve Data Quality in Artificial Intelligence Training

Ersin ARIKAN; Ender SAHINASLAN

doi:10.5269/bspm.79776

Ersin ARIKAN İstanbul Okan University https://orcid.org/0009-0007-6092-9531
Ender SAHINASLAN Mudanya University https://orcid.org/0000-0001-8519-7612

Resumo

Artificial intelligence model performance fundamentally depends on training data quality, yet most real-world datasets contain noise, mislabels, and distribution shifts that compromise learning effectiveness. We present a general methodology that combines multi-model predictions with prompt-based confidence scoring to systematically extract high-quality training data from noisy datasets. Our framework operates through a multi-phase pipeline where outputs from multiple pre-trained models are evaluated using contrastive language-image prompts to generate normalized confidence scores for each sample. High-confidence samples are retained for training while low-confidence samples are filtered or queued for review. To demonstrate this approach, we apply the methodology to age estimation using facial images as a representative computer vision task. In our case study, outputs from two different architectures (Buffalo-L and a custom lightweight model) are processed through CLIP-based semantic evaluation with positive and negative age-related prompts. Samples scoring above empirically determined thresholds are used to train efficient custom models, achieving improved accuracy and computational efficiency. The threshold optimization process employs ROC curve analysis on validation data, while the pipeline integrates automatic data balancing and label correction mechanisms. This framework addresses the critical challenge of data quality in weakly supervised scenarios and provides a scalable approach adaptable to diverse domains including text analysis, speech recognition, and medical imaging. Our age estimation case study demonstrates 29% accuracy improvement and 40% training time reduction using only 32% of the original dataset, validating the effectiveness of quality-focused data selection over quantity-based approaches.

Downloads

Não há dados estatísticos.

Biografia do Autor

Ersin ARIKAN, İstanbul Okan University

Graduate Education Institute, Information Systems

Ender SAHINASLAN, Mudanya University

Department of Management Information Systems

Referências

[1] M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan (2023). MaPLe: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). doi: 10.48550/arXiv.2210.03117
[2] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever (2021). Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020. doi: 10.48550/arXiv.2103.00020
[3] C. Northcutt, L. Jiang, and I. Chuang (2021). Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, vol. 70, pp. 1373-1411. doi: 10.1613/jair.1.12125
[4] P. Chen, B. B. Liao, G. Chen, and S. Zhang (2019). Understanding and utilizing deep neural networks trained with noisy labels. arXiv preprint arXiv:1905.05040. doi: 10.48550/arXiv.1905.05040
[5] J. Li, R. Socher, and S. C. H. Hoi (2020). DivideMix: Learning with noisy labels as semi-supervised learning. In International Conference on Learning Representations (ICLR). doi: 10.48550/arXiv.2002.07394
[6] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama (2018). Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Advances in Neural Information Processing Systems (NeurIPS), vol. 31. doi: 10.48550/arXiv.1804.06872
[7] G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu (2017). Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1944-1952. doi: 10.1109/CVPR.2017.240
[8] X. Yu, B. Han, J. Yao, G. Niu, I. Tsang, and M. Sugiyama (2019). How does disagreement help generalization against label corruption? In Proceedings of the International Conference on Machine Learning (ICML), pp. 7164-7173. doi: 10.48550/arXiv.1901.04215
[9] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich (2014). Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596. doi: 10.48550/arXiv.1412.6596
[10] D. Karimi, H. Dou, S. K. Warfield, and A. Gholipour (2020). Deep learning with noisy labels: Exploring techniques and remedies in medical image analysis. Medical Image Analysis, vol. 65, 101759. doi: 10.1016/j.media.2020.101759
[11] H.-T. Lin, C.-J. Lin, and R. C. Weng (2007). A note on Platt’s probabilistic outputs for support vector machines. Machine Learning, vol. 68, pp. 267–276. doi: 10.1007/s10994-007-5018-6
[12] T. Hofmann, B. Schölkopf, and A. J. Smola (2008). Kernel methods in machine learning. Annals of Statistics, 36(3), 1171–1220. doi: 10.1214/009053607000000677
[13] K. Nandakumar, A. K. Jain, and A. Ross (2008). Score normalization in multimodal biometric systems. Pattern Recognition, vol. 41, no. 12, pp. 2623–2632. doi: 10.1016/j.patcog.2008.03.024