P.K. Nikolyuk, Y.V. Myshkivska, M.I. Ovchar
Èlektron. model. 2026, 48(2):115-128
ABSTRACT
The development and implementation of a real-time multimodal object recognition system based on deep convolutional neural networks are presented. The main focus is on integrating data from RGB cameras, infrared sensors, and depth sensors, which allows for resistance to changes in lighting and loss of some sensor information. The proposed architecture is based on a modified You Only Look Once (YOLO) model and includes modules for preprocessing, noise filtering, adaptive feature fusion (late fusion), and optimization for embedded systems. This is an approach to object recognition in images, in which the entire scene is analyzed in a single pass of the neural network. Unlike traditional methods, which first detect regions of interest and then classify objects, YOLO performs localization and classification simultaneously, providing high speed and efficiency, especially in real time. Experimental results confirm the effectiveness of the proposed approach: the system provides higher accuracy in object classification using unmanned aerial vehicles (UAVs) in complex conditions compared to monomodal approaches. The developed solution is promising for use in military affairs, rescue operations, autonomous transport, and video surveillance.
KEYWORDS
multimodal recognition, UAV, YOLO, convolutional neural networks, Object Relation Module.
REFERENCES
- Ying, S. et. al. (2023), “Real-Time Segmentation of Artificial Targets Using a Dual-Modal Efficient Attention Fusion Network”, Remote Sensing, Vol. 15(18), 4398, DOI: org/10.3390/rs15184398
- Rui, C. et. al. (2024), “Drone-Based Visible-Thermal Object Detection with Transformers and Prompt Tuning”, Drones,, 8, 451. DOI: org/10.3390/drones8090451
- Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M. (2023). YOLOv7: Trainable bag of freebies sets new state of the art for real time object detectors. arXiv preprint. DOI: org/10.1109/ CVPR52729.2023.00721
- Zihao, Zhou al. (2023), “Object Detection in Drone Video with Temporal Attention Gated Recurrent Unit Based on Transformer”, Drones, Vol. 7(7), 466, DOI: org/10.3390/ drones7070466
- Gupta, A., Fernando X. (2022), “Simultaneous Localization and Mapping (SLAM) and Data Fusion in Unmanned Aerial Vehicles: Recent Advances and Challenges”, Drones, Vol. 6, 85. DOI: org/10.3390/drones6040085
- Jiang, Y., Zhang X., Li Z., Wang Y. (2021), “Multimodal sensor fusion for object detection in dynamic environments using drones”, URL: https://arxiv.org/abs/2110.12638
- Cai Y., Qin T., Ou Y., Wei R. (2023), “Intelligent Systems in Motion: A Comprehensive Review on Multi Sensor Fusion and Information Processing From Sensing to Navigation in Path Planning”, International Journal on Semantic Web and Information Systems, Vol. 19(1), pp. 1-35. DOI: http://dx.doi.org/10.4018/IJSWIS.333056
- Wojtkowiak A., Skoczylas J., Brudnowski T. (2023), “YOLOv5 Drone Detection Using Multimodal Data Registered by the Vicon System. Sensors, Vol. 23(14), 6396. DOI: https://doi.org/10.3390/s23146396
- Shufang, Xu et. al. (2025), “A Method for Airborne Small-Target Detection with a Multimodal Fusion Framework Integrating Photometric Perception and Cross-Attention Mechanisms”, Remote Sens., Vol. 17(7), 1118. DOI: https://doi.org/10.3390/rs17071118
- Li, S., Wang, H., et. al., (2020), “An adaptive data fusion strategy for fault diagnosis based on the convolutional neural network”, Journal of Applied Sciences, 165, 108122. DOI: org/10.1016/j.measurement.2020.108122
- Sarlin, P., Honkavaara, E., Vilkko, M., Hakala, T., Markelin, L. (2022), “Deep learning with RGB and thermal images onboard a drone for search and rescue operations”, Journal of Field Robotics, Vol. 39(7), pp. 1063-1086. DOI: https://doi.org/10.1002/rob.22082
- Hu, R., Li, Z., Yang, J. (2018), “Relation Networks for Object Detection”, Proceedings of the IEEE CVF Conference on Computer Vision and Pattern Recognition CVPR”, pp. 3588-3597. URL: https://cutt.ly/srTW49wS
- Ovchar, M.”Multimodal Object Detection GitHub”, URL: https://cutt.ly/prTW5i3W
- Valada, A., Vertens, J., Dhall, A., Burgard, W. (2019), “Self supervised model adaptation for multimodal semantic segmentation”, International Journal of Computer Vision, Vol. 128, 1239-1285 DOI: org/10.1007/s11263-019-01188-y
- Jiang, Z., Zhao, L., Li, S., Jia, Y. (2020), “Real-time object detection method based on improved YOLOv4-tiny”, arXiv preprint, DOI: org/10.48550/arXiv.2011.04244