Efficient Data Preprocessing for Extractive Question Answering Models
Résumé
abstract: Thisstudypresentsasystematicapproachtobuildingadomain-specificquestion-answering(QA)
dataset fromIndianLokSabhaparliamentaryproceedings,withaprimaryfocusonmeticulousdataprepro
cessing.Parliamentarytranscriptsareoftenlengthy,noisy,andunstructured,posingsignificantchallengesfor
downstreamnatural languageprocessing(NLP)tasks.Toaddressthis,wedesignedacomprehensiveprepro
cessingpipelineinvolvingcleaning,segmentation,annotation,normalization,andtokenizationtoconvertraw
transcriptsintostructured,high-qualityQA-readydata.Eachstepwastailoredtothelinguisticandstructural
characteristicsofparliamentarytext. Experimentalevaluationthroughanablationstudydemonstratedthat
ourpreprocessingpipeline ledtoasignificantperformance improvementof9.4%inExactMatch(EM)and
8.5%inF1scorewhenusedtotrainaBERT-basedQAmodel.Additionally,weconductedbiasanalysisand
comparedourdataset’sperformancewithstandardbenchmarks tovalidate itsqualityandrelevance. This
workunderscoresthatrobustpreprocessingisfoundationaltocreatingreliable,domain-adaptedQAdatasets
forextractivemodels
Téléchargements
Copyright (c) 2025 Boletim da Sociedade Paranaense de Matemática

Ce travail est disponible sous la licence Creative Commons Attribution 4.0 International .
When the manuscript is accepted for publication, the authors agree automatically to transfer the copyright to the (SPM).
The journal utilize the Creative Common Attribution (CC-BY 4.0).



