The rapid proliferation of AI-generated synthetic media has posed substantial threats to digital trust, particularly through audio deepfakes and manipulated text. Existing unimodal detection systems that analyze either audio or text in isolation remain insufficient to counter advanced generative attacks that exploit both modalities simultaneously. This paper proposes an AI-driven multimodal fake content detection framework that jointly leverages acoustic and linguistic signals to enable robust deepfake identification. Mel-Frequency Cepstral Coefficients (MFCCs) and Mel-Spectrograms are extracted from raw audio to capture spectral and temporal vocal patterns. At the same time, BERT-based transformer embeddings encode semantic and contextual information from transcripts generated via Automatic Speech Recognition (ASR). An attention-based fusion layer dynamically weights and integrates both feature streams, and a Random Forest–XGBoost ensemble classifier performs the final authenticity prediction. Experiments conducted on the ASVspoof 2019 benchmark demonstrate a classification accuracy of 95%, with precision of 93%, recall of 94%, and F1-score of 95%, outperforming standalone audio-only and text-only baselines by approximately 4–7%. These findings confirm that cross-modal feature fusion substantially reduces false-detection rates and improves generalization over single-modality approaches. The proposed system offers practical applicability in cybersecurity, voice biometrics, and digital forensics.