The increasing demand for drone-based surveillance systems has raised significant concerns about advancements in person and activity recognition based on joint motion features within visual monitoring frameworks. This study contributes to developing deep learning models that improve surveillance systems by using RGB video data recorded by drone cameras. In this study, a framework for person and activity recognition based on 120 datasets is proposed, from drone camera-recorded videos of 10 subjects, each performing six movements: walking, running, jogging, boxing, waving, and clapping. Joint motion features, including joint positions and joint angles, were extracted and processed as one-dimensional series data. The 1D-CNN, LeNet, AlexNet, and AlexNet-LSTM architectures were developed and evaluated for classification tasks. Evaluation results show that AlexNet-LSTM outperformed the other models in person recognition, achieving a classification accuracy of 0.8544, a precision of 0.9161, a recall of 0.8575, and an F1-score of 0.8332, while AlexNet delivered superior performance in activity recognition with an accuracy of 0.8571, a precision of 0.8442, a recall of 0.8599, and an F1-score of 0.8463. The relatively small dataset size used likely favors simpler architectures like AlexNet. These findings highlight the effectiveness of joint motion features for person identification and emphasize the suitability of simpler classifier architectures for activity classification when working with small datasets.