Abstrak Pada era digital ini, teknologi pengenalan ucapan atau speech recognition menjadi semakin penting karena digunakan dalam berbagai aplikasi seperti asisten virtual, sistem navigasi suara, layanan transkripsi otomatis, hingga perangkat pintar berbasis Internet of Things (IoT). Penelitian ini bertujuan untuk meningkatkan kinerja sistem speech-to-text dengan menggabungkan metode Mel-Frequency Cepstral Coefficients (MFCC) dan Dynamic Time Warping (DTW). MFCC digunakan untuk mengekstraksi ciri khas dari sinyal suara, sedangkan DTW membantu menyesuaikan perbedaan kecepatan atau skala waktu pada urutan data suara yang bervariasi. Selanjutnya, metode K-Nearest Neighbors (K-NN) diterapkan untuk melakukan klasifikasi teks berdasarkan fitur-fitur yang telah diekstraksi. Hasil pengujian menunjukkan kombinasi MFCC, DTW, dan K-NN mampu meningkatkan akurasi, precision, recall, dan F1-score hingga 84%. Pendekatan ini efektif digunakan pada platform embedded seperti Raspberry Pi yang memiliki keterbatasan sumber daya komputasi, sehingga tetap mampu memberikan performa yang andal untuk pengenalan ucapan. Kata kunci: Dynamic Time Warping, Mel-Frequency Cepstral Coefficients, K Nearest Neighbors, Speech-to-Text, Pengenalan Ucapan. Abstract In today’s digital era, speech recognition technology has become increasingly important, powering various applications such as virtual assistants, voice navigation systems, automated transcription services, and smart devices based on the Internet of Things (IoT). This study aims to enhance the performance of a speech-to-text system by combining the Mel-Frequency Cepstral Coefficients (MFCC) and Dynamic Time Warping (DTW) methods. MFCC is used to extract distinctive features from speech signals, while DTW helps align differences in speed or time scale among varying speech data sequences. Furthermore, the K-Nearest Neighbors (K-NN) algorithm is applied to classify text based on the extracted features. Experimental results demonstrate that the combination of MFCC, DTW, and K-NN can achieve an accuracy, precision, recall, and F1-score of up to 84%. This approach is proven to be effective on embedded platforms such as Raspberry Pi, which have limited computational resources, while still maintaining reliable performance for accurate speech recognition tasks. Keywords: Dynamic Time Warping, Mel-Frequency Cepstral Coefficients, K-Nearest Neighbors, Speech-to-Text, Speech Recognition.