Optical character recognition system with natural language processing for data recovery on scanned old academic card reports
DOI:
https://doi.org/10.4025/actascitechnol.v47i1.69814Palavras-chave:
accuracy; image processing; chat-GPT; digital records; preservation.Resumo
In the digital age, preserving and effectively retrieving historical academic records has a significant challenge, especially when these documents only exist in deteriorated physical formats. We propose an approach to recover data from scanned documents of grade records, by using image processing and Natural Language Processing (NLP) to enhance the accuracy of Optical Character Recognition (OCR) in these documents, essential for the preservation of digital records. Our three-step methodology: first, improves the quality of the scanned image; then, extracts text using OCR and NLP techniques to retrieve data from old physical grade cards; and finally, the extracted data is corrected using Chat-GPT and prepared for upload. The results are assuring, showing an impressive Character Error Rate (CER) of 2.15% and a Word Error Rate (WER) of 7.05%, demonstrating the high accuracy of the OCR system used and its ability to precisely extract text from scanned documents. These low error rates achieved, as a result to the successful implementation of pre-processing and post-processing techniques, as well as the use of an advanced OCR tool, underscore the potential of this OCR approach to effectively extract information from documents.
Downloads
Downloads
Publicado
Como Citar
Edição
Seção
Licença
DECLARAÇíO DE ORIGINALIDADE E DIREITOS AUTORAIS
Declaro que o presente artigo é original, não tendo sido submetido í publicação em qualquer outro periódico nacional ou internacional, quer seja em parte ou em sua totalidade.
Os direitos autorais pertencem exclusivamente aos autores. Os direitos de licenciamento utilizados pelo periódico é a licença Creative Commons Attribution 4.0 (CC BY 4.0): são permitidos o compartilhamento (cópia e distribuição do material em qualqer meio ou formato) e adaptação (remix, transformação e criação de material a partir do conteúdo assim licenciado para quaisquer fins, inclusive comerciais.
Recomenda-se a leitura desse link para maiores informações sobre o tema: fornecimento de créditos e referências de forma correta, entre outros detalhes cruciais para uso adequado do material licenciado.
