Automatic Speech Recognition (ASR) for the Indonesian language faces significant challenges due to high Word Error Rate (WER), especially when using pre-trained models without fine-tuning. This study develops an optimized ASR system using a hybrid cloud architecture that integrates the Faster-Whisper large-v3 engine with advanced audio preprocessing techniques. The system adopts a distributed architecture, with Google Colab (Tesla T4, 15GB VRAM) as the GPU server and Ubuntu 22.04 LTS (8 core, 32GB RAM) as the client. Evaluation was conducted on five Indonesian audio samples covering formal news, informal conversations, and long-duration recordings. The system achieved an 80% success rate in processing, with WER ranging from 27.69% (formal news) to 645.16% (informal conversations). Resource utilization was also efficient, with 21.3% GPU usage and 35.4% RAM usage. Processing time remained stable for normal-sized files but experienced timeouts on large files (>50MB). The results indicate that hybrid cloud architecture is feasible for distributed ASR processing in Indonesian, with several areas still open for optimization toward production deployment.