Trisha, J. S.
Unknown Affiliation

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search

AI-Driven Multi- Modal Fake Content Detection System Using Audio-Text Fusion and Transformer Network Jeeva, S.; Trisha, J. S.; Keerthana, S.
Journal of Technology Informatics and Engineering Vol. 5 No. 1 (2026): APRIL | JTIE : Journal of Technology Informatics and Engineering
Publisher : University of Science and Computer Technology

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.51903/jtie.v5i1.475

Abstract

The rapid proliferation of AI-generated synthetic media has posed substantial threats to digital trust, particularly through audio deepfakes and manipulated text. Existing unimodal detection systems that analyze either audio or text in isolation remain insufficient to counter advanced generative attacks that exploit both modalities simultaneously. This paper proposes an AI-driven multimodal fake content detection framework that jointly leverages acoustic and linguistic signals to enable robust deepfake identification. Mel-Frequency Cepstral Coefficients (MFCCs) and Mel-Spectrograms are extracted from raw audio to capture spectral and temporal vocal patterns. At the same time, BERT-based transformer embeddings encode semantic and contextual information from transcripts generated via Automatic Speech Recognition (ASR). An attention-based fusion layer dynamically weights and integrates both feature streams, and a Random Forest–XGBoost ensemble classifier performs the final authenticity prediction. Experiments conducted on the ASVspoof 2019 benchmark demonstrate a classification accuracy of 95%, with precision of 93%, recall of 94%, and F1-score of 95%, outperforming standalone audio-only and text-only baselines by approximately 4–7%. These findings confirm that cross-modal feature fusion substantially reduces false-detection rates and improves generalization over single-modality approaches. The proposed system offers practical applicability in cybersecurity, voice biometrics, and digital forensics.