El Rhaffouli, Yassine
Unknown Affiliation

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search
Journal : Linguistics Initiative

Evaluating Google’s new TranslateGemma for English–Arabic machine translation: A corpus-based assessment using United Nations documents El Rhaffouli, Yassine; Boughaba, Hicham
Linguistics Initiative Vol. 6 No. 1 (2026)
Publisher : Pusat Studi Bahasa dan Publikasi Ilmiah

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.53696/27753719.61411

Abstract

This study presents a systematic evaluation of Google’s newly released model TranslateGemma-4B for English-Arabic machine translation using official United Nations parallel documents. Despite the recent proliferation of large language models claiming multilingual competence, empirical assessments of translation quality for morphologically rich languages which are languages that exhibit many grammatical cases and inflected word forms such as Arabic remain limited, particularly when evaluated against professionally approved institutional reference translations. We assessed TranslateGemma in a zero-shot configuration (i.e., the prompt used contained only the task objective and no examples or demonstrations) on 10,000 sentence pairs drawn from the UN English–Arabic Parallel Corpus. The model was executed on Google Colab due to computational requirements exceeding local hardware capacity. Translation outputs were evaluated using five complementary automatic metrics: BLEU (6.95), chrF++ (33.28), METEOR (20.90), BERTScore (74.21), and COMET (71.60). Additionally, we implemented diagnostic heuristics to detect omissions, hallucinations (i.e., inventing irrelevant texts), digit mismatches, and terminology inconsistencies. Results indicate that while TranslateGemma achieves moderate semantic similarity scores on neural metrics, it exhibits substantial deficiencies in surface-form accuracy and institutional terminology consistency. Sentence length ratio analysis revealed systematic under-translation patterns, with 16.86% of outputs flagged for potential omission. Digit mismatch errors affected 24.48% of the corpus, raising concerns for high-stakes translation contexts. Terminology consistency analysis using an expanded UN glossary indicated that 7.95% of sentences containing frozen institutional terms failed to preserve standard Arabic equivalents. These findings demonstrate that TranslateGemma, while showing promise in capturing broad semantic adequacy, requires significant refinement before deployment in institutional translation workflows where precision and terminological fidelity are paramount.