This study explores the use of deep clustering methods to automatically group handwritten essay answer sheets based on their visual patterns. Feature extraction was performed using three backbone models: ResNet-50, Vision Transformer (ViT-base), and Tr-OCR. These features were then clustered using two unsupervised algorithms—K-means (with k=5) and HDBSCAN (with minimum cluster size = 10). To enhance clustering performance, a deep clustering approach was implemented by applying K-means iteratively to refine feature representations. Evaluation was conducted both quantitatively, using Silhouette Score, Davies-Bouldin Index, and Calinski- Harabasz Score, and qualitatively, through t-SNE visualizations and cluster content inspection. The ViT and Tr-OCR backbones outperformed CNN-based ResNet-50, achieving higher cluster cohesion and separation. Notably, the final clustering result using ViT with HDBSCAN reached a Silhouette Score of 0.772, Davies-Bouldin Index of 0.369, and Calinski-Harabasz Score of 408.006. The findings indicate that vision transformer-based models are more effective for unsupervised grouping of handwritten visual data. This approach can assist educators in accelerating and objectifying the grading process and may serve as a foundation for future automated essay evaluation systems integrating OCR and NLP techniques.
Copyrights © 2025