This paper proposes a joint-fusion multi-modal Artificial Intelligence of Things (AIoT) framework for precision horticulture on chili and tomato. We fuse time-series IoT signals (air/leaf temperature, humidity, soil moisture, pH, EC, PAR) with RGB/multispectral images of leaves, fruits, and canopy via an attention-based shared representation. In a 500 m² field trial in Majalengka with >5,000 labeled images and synchronized IoT streams (10-minute interval), our model outperforms single-modal baselines. For chili leaf disease detection, joint fusion reaches 90.0% accuracy (IoT-only 72.0%, vision-only 81.0%). For tomato maturity classification, it achieves 92.0% accuracy (IoT-only 68.0%, vision-only 84.0%). For yield estimation, the multi-modal regressor attains R² = 0.89. We detail data synchronization, train/validation/test splits, baseline configurations (IoT-LSTM, CNN/ViT, early/late fusion), and deployment on an edge-cloud pipeline. The results indicate that modeling cross-modal interactions improves robustness and decision support for irrigation, fertilization, and harvest scheduling. We conclude with ablation analyses and practical implications for Indonesian precision agriculture.