Towards Robust Arabic Authorship Attribution: A Transformer-Based Model for Multi-Author Imbalanced Corpora

Salah KHENNOUF; Hassina  HADJADJ; Mounir  BOURAS; Abdelhafid  BENYOUNES

doi:10.5269/bspm.81029

Salah KHENNOUF University of Msila
Hassina HADJADJ University of Science and Technology Houari Boumediene (USTHB)
Mounir BOURAS University of M’sila (28000) Algeria
Abdelhafid BENYOUNES Department of Electrical Engineering Faculty of Technology Electrical Engineering Laboratory (LGE) University of M’sila (28000) Algeria

Resumen

Authorship Attribution (AA) seeks to identify the author of a text by analyzing distinctive linguistic and stylistic features. While several studies have focused on English and other Latin-based languages, Arabic AA -particularly with imbalanced datasets- remains comparatively underexplored. This work addresses this gap by investigating the effectiveness of Arabic pretrained transformer models for multi-class AA using a newly constructed dataset, SAB-2, composed of seven Arabic books with heterogeneous segment lengths. Given the strong impact of data imbalance on classification performance, we apply the Synthetic Minority Oversampling Technique (SMOTE) to enhance minority-class representation and examine its influence on model accuracy. Our experiments evaluate several transformer-based models -AraBERT, AraELECTRA, ARBERT, and MARBERT- alongside deep learning architectures (LSTM, CNN, and a hybrid LSTM-CNN model). Results show that SMOTE substantially improves performance across all models, with the LSTM-CNN architecture combined with AraBERT achieving the highest accuracy of 89%, outperforming baseline experiments without balancing. The obtained results show the robustness of Arabic pretrained transformers in capturing stylistic features from limited and imbalanced textual data, highlighting their potential for advancing Arabic authorship attribution in resource-constrained domains.

Keywords— Arabic Authorship Attribution, Transformer-based Models, SMOTE, Deep Learning for NLP, Imbalanced Datasets, Stylometry.

Descargas

La descarga de datos todavía no está disponible.