Linguistics Initiative
Vol. 6 No. 1 (2026)

Evaluating Google’s new TranslateGemma for English–Arabic machine translation: A corpus-based assessment using United Nations documents

El Rhaffouli, Yassine (Unknown)
Boughaba, Hicham (Unknown)



Article Info

Publish Date
24 Feb 2026

Abstract

This study presents a systematic evaluation of Google’s newly released model TranslateGemma-4B for English-Arabic machine translation using official United Nations parallel documents. Despite the recent proliferation of large language models claiming multilingual competence, empirical assessments of translation quality for morphologically rich languages which are languages that exhibit many grammatical cases and inflected word forms such as Arabic remain limited, particularly when evaluated against professionally approved institutional reference translations. We assessed TranslateGemma in a zero-shot configuration (i.e., the prompt used contained only the task objective and no examples or demonstrations) on 10,000 sentence pairs drawn from the UN English–Arabic Parallel Corpus. The model was executed on Google Colab due to computational requirements exceeding local hardware capacity. Translation outputs were evaluated using five complementary automatic metrics: BLEU (6.95), chrF++ (33.28), METEOR (20.90), BERTScore (74.21), and COMET (71.60). Additionally, we implemented diagnostic heuristics to detect omissions, hallucinations (i.e., inventing irrelevant texts), digit mismatches, and terminology inconsistencies. Results indicate that while TranslateGemma achieves moderate semantic similarity scores on neural metrics, it exhibits substantial deficiencies in surface-form accuracy and institutional terminology consistency. Sentence length ratio analysis revealed systematic under-translation patterns, with 16.86% of outputs flagged for potential omission. Digit mismatch errors affected 24.48% of the corpus, raising concerns for high-stakes translation contexts. Terminology consistency analysis using an expanded UN glossary indicated that 7.95% of sentences containing frozen institutional terms failed to preserve standard Arabic equivalents. These findings demonstrate that TranslateGemma, while showing promise in capturing broad semantic adequacy, requires significant refinement before deployment in institutional translation workflows where precision and terminological fidelity are paramount.

Copyrights © 2026






Journal Info

Abbrev

live

Publisher

Subject

Humanities Languange, Linguistic, Communication & Media

Description

Linguistics Initiative is an academic journal that presents issues in linguistics and applied linguistics from multi-disciplinary approaches. This journal publishes articles that discuss research on language as a system of communication or a cognitive, social, and historical phenomenon as well as ...