Optical character recognition system with natural language processing for data recovery on scanned old academic card reports

Autores

DOI:

https://doi.org/10.4025/actascitechnol.v47i1.69814

Palavras-chave:

accuracy; image processing; chat-GPT; digital records; preservation.

Resumo

In the digital age, preserving and effectively retrieving historical academic records has a significant challenge, especially when these documents only exist in deteriorated physical formats. We propose an approach to recover data from scanned documents of grade records, by using image processing and Natural Language Processing (NLP) to enhance the accuracy of Optical Character Recognition (OCR) in these documents, essential for the preservation of digital records. Our three-step methodology: first, improves the quality of the scanned image; then, extracts text using OCR and NLP techniques to retrieve data from old physical grade cards; and finally, the extracted data is corrected using Chat-GPT and prepared for upload. The results are assuring, showing an impressive Character Error Rate (CER) of 2.15% and a Word Error Rate (WER) of 7.05%, demonstrating the high accuracy of the OCR system used and its ability to precisely extract text from scanned documents. These low error rates achieved, as a result to the successful implementation of pre-processing and post-processing techniques, as well as the use of an advanced OCR tool, underscore the potential of this OCR approach to effectively extract information from documents.

Downloads

Não há dados estatísticos.

Downloads

Publicado

2024-12-09

Como Citar

Casas-Huamanta, E. R., Pinedo, L., Barbachán-Ruales, E. A. ., Cardenas-Garcí­a, A. ., Rossel-Bernedo, L. A. ., & Seijas-Dí­az, J. G. (2024). Optical character recognition system with natural language processing for data recovery on scanned old academic card reports. Acta Scientiarum. Technology, 47(1), e69814. https://doi.org/10.4025/actascitechnol.v47i1.69814

Edição

Seção

Ciência da Computação

 

0.8
2019CiteScore
 
 
36th percentile
Powered by  Scopus

 

 

0.8
2019CiteScore
 
 
36th percentile
Powered by  Scopus