This study investigates irrelevancy within multilingual tourism reviews, focusing on how off-topic or ambiguous user-generated content can undermine reliable insight for travelers. A consolidated dataset is constructed by combining a publicly available resource from Kaggle with additional posts acquired from X (formerly Twitter). Each review is manually labeled as relevant or ambiguous to capture instances where the content fails to clearly address travel or hotel-related topics. We employ a multilingual BERT embedding model to encode the diverse language inputs, enriched with a sentiment vector derived via knowledge distillation from twitter-xlm-roberta-base to DistilBERT. A gating mechanism then fuses the semantic and emotional signals, highlighting parts of each review most influenced by user attitudes. The final classification stage involves fine-tuning a BERT-based network to distinguish between unambiguous and ambiguous content. Experimental comparisons with a Monolingual BERT approach and a baseline (multilingual embedding without sentiment) reveal that incorporating sentiment features yields consistent improvements in accuracy, precision, recall, and F1-score. This outcome underscores the importance of capturing emotional cues to mitigate errors arising from partial dissatisfaction, unclear references, or cultural nuances. From a practical standpoint, the results point to potential applications in automated moderation, improved recommendation systems, and policy guidelines for tourism platforms. Overall, this work demonstrates that sentiment-aware, multilingual models can enhance detection of irrelevancy and ambiguity, fostering more trustworthy and context-rich online review ecosystems in the travel domain.
Copyrights © 2024