This research presents an instruction-efficient and parallelized implementation of the AES-256 encryption algorithm using NVIDIA CUDA with inline PTX to optimize instruction usage and execution performance on GPUs. Conventional AES implementation on CUDA often suffers from redundant instructions and high computational overhead, limiting encryption throughput. To address this, three implementation variants were developed: a baseline version, a parallelized version, and an inline PTX-optimized version. The proposed PTX-enhanced AES reduces instruction redundancy through vectorized memory access, byte permutation, and logical operation consolidation using PTX instructions. Instruction analysis using NVIDIA Nsight Compute and cuobjdump revealed a 66% reduction in executed instructions compared to the baseline, primarily due to optimized AES operations. Performance evaluation of 32 MB plaintext demonstrates a 16-fold improvement in throughput, increasing from 2.73 Gbps in the baseline to 43.9 Gbps in the PTX-optimized version, with cycles per byte decreasing from 2.63 to 0.164. These results confirm that integrating fine-grained PTX control within CUDA enables substantial gains in encryption efficiency while maintaining correctness and security. The proposed approach provides a scalable foundation for high-performance cryptographic operations in GPU-based computing environments.
Copyrights © 2026