IAES International Journal of Artificial Intelligence (IJ-AI)
Vol 15, No 2: April 2026

TunDC: a public benchmark dataset for sentiment analysis and language modeling in the Tunisian dialect

Khalil Boulahia, Ahmed (Unknown)
Mars, Mourad (Unknown)



Article Info

Publish Date
01 Apr 2026

Abstract

The development of natural language processing (NLP) applications has increasingly focused on dialectal variations of languages. The Tunisian dialect (TD), a widely spoken variant of Arabic, poses unique linguistic challenges due to its lack of standardized writing conventions and influences from multiple languages, including French, Italian, Turkish, and Berber. In this work, we introduce TunDC, a dataset of 20,044 labeled comments designed to advance NLP research on the TD. The dataset covers diverse linguistic forms (Arabic, Latin, and mixed scripts), and each comment was manually annotated for positive or negative sentiment by native speakers, achieving high inter-annotator agreement. To evaluate its effectiveness, we fine-tuned various models on TunDC. The bert-base-arabic-TunDC-mixed model achieved an accuracy of 0.84 and a macro-averaged F1-score of 0.83, demonstrating strong generalization across sentiment categories and writing systems. A stratified data-splitting strategy considering both sentiment and script type further improved accuracy by approximately 8% compared to standard splits. As a publicly available resource, TunDC contributes to the computational linguistics community, fostering advancements in language modeling and applications tailored to the TD.

Copyrights © 2026






Journal Info

Abbrev

IJAI

Publisher

Subject

Computer Science & IT Engineering

Description

IAES International Journal of Artificial Intelligence (IJ-AI) publishes articles in the field of artificial intelligence (AI). The scope covers all artificial intelligence area and its application in the following topics: neural networks; fuzzy logic; simulated biological evolution algorithms (like ...