Multimodal Object Recognition System Based on Modified Yolo Architecture

P.K. Nikolyuk, Y.V. Myshkivska, M.I. Ovchar

Èlektron. model. 2026, 48(2):115-128

ABSTRACT

The development and implementation of a real-time multimodal object recognition system based on deep convolutional neural networks are presented. The main focus is on integrating data from RGB cameras, infrared sensors, and depth sensors, which allows for resistance to changes in lighting and loss of some sensor information. The proposed architecture is based on a modified You Only Look Once (YOLO) model and includes modules for preprocessing, noise filtering, adaptive feature fusion (late fusion), and optimization for embedded systems. This is an approach to object recognition in images, in which the entire scene is analyzed in a single pass of the neural network. Unlike traditional methods, which first detect regions of interest and then classify objects, YOLO performs localization and classification simultaneously, providing high speed and efficiency, especially in real time. Experimental results confirm the effectiveness of the proposed approach: the system provides higher accuracy in object classification using unmanned aerial vehicles (UAVs) in complex conditions compared to monomodal approaches. The developed solution is promising for use in military affairs, rescue operations, autonomous transport, and video surveillance.

Full text: PDF

KEYWORDS

multimodal recognition, UAV, YOLO, convolutional neural networks, Object Relation Module.

REFERENCES

Ying, S. et. al. (2023), “Real-Time Segmentation of Artificial Targets Using a Dual-Modal Efficient Attention Fusion Network”, Remote Sensing, Vol. 15(18), 4398, DOI: org/10.3390/rs15184398
Rui, C. et. al. (2024), “Drone-Based Visible-Thermal Object Detection with Transformers and Prompt Tuning”, Drones,, 8, 451. DOI: org/10.3390/drones8090451
Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M. (2023). YOLOv7: Trainable bag of freebies sets new state of the art for real time object detectors. arXiv preprint. DOI: org/10.1109/ CVPR52729.2023.00721
Zihao, Zhou al. (2023), “Object Detection in Drone Video with Temporal Attention Gated Recurrent Unit Based on Transformer”, Drones, Vol. 7(7), 466, DOI: org/10.3390/ drones7070466
Gupta, A., Fernando X. (2022), “Simultaneous Localization and Mapping (SLAM) and Data Fusion in Unmanned Aerial Vehicles: Recent Advances and Challenges”, Drones, Vol. 6, 85. DOI: org/10.3390/drones6040085
Jiang, Y., Zhang X., Li Z., Wang Y. (2021), “Multimodal sensor fusion for object detection in dynamic environments using drones”, URL: https://arxiv.org/abs/2110.12638
Cai Y., Qin T., Ou Y., Wei R. (2023), “Intelligent Systems in Motion: A Comprehensive Review on Multi Sensor Fusion and Information Processing From Sensing to Navigation in Path Planning”, International Journal on Semantic Web and Information Systems, Vol. 19(1), pp. 1-35. DOI: http://dx.doi.org/10.4018/IJSWIS.333056
Wojtkowiak A., Skoczylas J., Brudnowski T. (2023), “YOLOv5 Drone Detection Using Multimodal Data Registered by the Vicon System. Sensors, Vol. 23(14), 6396. DOI: https://doi.org/10.3390/s23146396
Shufang, Xu et. al. (2025), “A Method for Airborne Small-Target Detection with a Multimodal Fusion Framework Integrating Photometric Perception and Cross-Attention Mechanisms”, Remote Sens., Vol. 17(7), 1118. DOI: https://doi.org/10.3390/rs17071118
Li, S., Wang, H., et. al., (2020), “An adaptive data fusion strategy for fault diagnosis based on the convolutional neural network”, Journal of Applied Sciences, 165, 108122. DOI: org/10.1016/j.measurement.2020.108122
Sarlin, P., Honkavaara, E., Vilkko, M., Hakala, T., Markelin, L. (2022), “Deep learning with RGB and thermal images onboard a drone for search and rescue operations”, Journal of Field Robotics, Vol. 39(7), pp. 1063-1086. DOI: https://doi.org/10.1002/rob.22082
Hu, R., Li, Z., Yang, J. (2018), “Relation Networks for Object Detection”, Proceedings of the IEEE CVF Conference on Computer Vision and Pattern Recognition CVPR”, pp. 3588-3597. URL: https://cutt.ly/srTW49wS
Ovchar, M.”Multimodal Object Detection GitHub”, URL: https://cutt.ly/prTW5i3W
Valada, A., Vertens, J., Dhall, A., Burgard, W. (2019), “Self supervised model adaptation for multimodal semantic segmentation”, International Journal of Computer Vision, Vol. 128, 1239-1285 DOI: org/10.1007/s11263-019-01188-y
Jiang, Z., Zhao, L., Li, S., Jia, Y. (2020), “Real-time object detection method based on improved YOLOv4-tiny”, arXiv preprint, DOI: org/10.48550/arXiv.2011.04244