Towards Robust Arabic Authorship Attribution: A Transformer-Based Model for Multi-Author Imbalanced Corpora
Abstract
Authorship Attribution (AA) seeks to identify the author of a text by analyzing distinctive linguistic and stylistic features. While several studies have focused on English and other Latin-based languages, Arabic AA -particularly with imbalanced datasets- remains comparatively underexplored. This work addresses this gap by investigating the effectiveness of Arabic pretrained transformer models for multi-class AA using a newly constructed dataset, SAB-2, composed of seven Arabic books with heterogeneous segment lengths. Given the strong impact of data imbalance on classification performance, we apply the Synthetic Minority Oversampling Technique (SMOTE) to enhance minority-class representation and examine its influence on model accuracy. Our experiments evaluate several transformer-based models -AraBERT, AraELECTRA, ARBERT, and MARBERT- alongside deep learning architectures (LSTM, CNN, and a hybrid LSTM-CNN model). Results show that SMOTE substantially improves performance across all models, with the LSTM-CNN architecture combined with AraBERT achieving the highest accuracy of 89%, outperforming baseline experiments without balancing. The obtained results show the robustness of Arabic pretrained transformers in capturing stylistic features from limited and imbalanced textual data, highlighting their potential for advancing Arabic authorship attribution in resource-constrained domains.
Keywords— Arabic Authorship Attribution, Transformer-based Models, SMOTE, Deep Learning for NLP, Imbalanced Datasets, Stylometry.
Downloads
Copyright (c) 2026 Boletim da Sociedade Paranaense de Matemática

This work is licensed under a Creative Commons Attribution 4.0 International License.
When the manuscript is accepted for publication, the authors agree automatically to transfer the copyright to the (SPM).
The journal utilize the Creative Common Attribution (CC-BY 4.0).



