This study attempts to address overfitting, a frequent problem with multimodal emotion identification models. This study proposes model optimization using various hyperparameter approaches, such as dropout layer, l2 kernel regularization, batch normalization, and learning rate schedule, and discovers which approach yields the most impact for optimizing the model from overfitting. For the emotion dataset, this research utilizes the interactive emotional dyadic motion capture (IEMOCAP) dataset and uses the motion capture and speech audio data modality. The models used in this experiment are convolutional neural network (CNN) for the motion capture data and CNN-bidirectional long short-term memory (CNN-BiLSTM) for the audio data. This study also applied a smaller model batch size in the experiment to accommodate the limited computing resources. The result of the experiment is that the optimization using hyperparameter tuning raises the validation accuracy to 73.67% and the f1-score to 73% on audio and motion capture data, respectively, from the base model of this research and can competitively compete with another research model result. It is hoped that the optimization experiment results in this study can be useful for future emotion recognition research, especially for those who have encountered overfitting problems.