Multimodal Object Recognition System Based on Modified Yolo Architecture

P.K. Nikolyuk, Y.V. Myshkivska, M.I. Ovchar

Èlektron. model. 2026, 48(2):115-128

ABSTRACT

The development and implementation of a real-time multimodal object recognition system based on deep convolutional neural networks are presented. The main focus is on integrating data from RGB cameras, infrared sensors, and depth sensors, which allows for resistance to changes in lighting and loss of some sensor information. The proposed architecture is based on a modified You Only Look Once (YOLO) model and includes modules for preprocessing, noise filtering, adaptive feature fusion (late fusion), and optimization for embedded systems. This is an approach to object recognition in images, in which the entire scene is analyzed in a single pass of the neural network. Unlike traditional methods, which first detect regions of interest and then classify objects, YOLO performs localization and classification simultaneously, providing high speed and efficiency, especially in real time. Experimental results confirm the effectiveness of the proposed approach: the system provides higher accuracy in object classification using unmanned aerial vehicles (UAVs) in complex conditions compared to monomodal approaches. The developed solution is promising for use in military affairs, rescue operations, autonomous transport, and video surveillance.

Full text: PDF

KEYWORDS

multimodal recognition, UAV, YOLO, convolutional neural networks, Object Relation Module.

REFERENCES

  1. Ying, S. et. al. (2023), “Real-Time Segmentation of Artificial Targets Using a Dual-Modal Efficient Attention Fusion Network”, Remote Sensing, Vol. 15(18), 4398,  DOI: org/10.3390/rs15184398
  2. Rui, C. et. al. (2024), “Drone-Based Visible-Thermal Object Detection with Transformers and Prompt Tuning”, Drones,, 8, 451. DOI: org/10.3390/drones8090451
  3. Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M. (2023). YOLOv7: Trainable bag of freebies sets new state of the art for real time object detectors. arXiv preprint. DOI: org/10.1109/ CVPR52729.2023.00721
  4. Zihao, Zhou al. (2023), “Object Detection in Drone Video with Temporal Attention Gated Recurrent Unit Based on Transformer”, Drones, Vol. 7(7), 466, DOI: org/10.3390/ drones7070466
  5. Gupta, A., Fernando X. (2022), “Simultaneous Localization and Mapping (SLAM) and Data Fusion in Unmanned Aerial Vehicles: Recent Advances and Challenges”, Drones, Vol. 6, 85. DOI: org/10.3390/drones6040085
  6. Jiang, Y., Zhang X., Li Z., Wang Y. (2021), “Multimodal sensor fusion for object detection in dynamic environments using drones”, URL: https://arxiv.org/abs/2110.12638
  7. Cai Y., Qin T., Ou Y., Wei R. (2023), “Intelligent Systems in Motion: A Comprehensive Review on Multi Sensor Fusion and Information Processing From Sensing to Navigation in Path Planning”, International Journal on Semantic Web and Information Systems, Vol. 19(1), pp. 1-35. DOI: http://dx.doi.org/10.4018/IJSWIS.333056
  8. Wojtkowiak A., Skoczylas J., Brudnowski T. (2023), “YOLOv5 Drone Detection Using Multimodal Data Registered by the Vicon System. Sensors, Vol. 23(14), 6396. DOI: https://doi.org/10.3390/s23146396
  9. Shufang, Xu et. al. (2025), “A Method for Airborne Small-Target Detection with a Mul­timodal Fusion Framework Integrating Photometric Perception and Cross-Attention Mecha­nisms”, Remote Sens., Vol. 17(7), 1118. DOI: https://doi.org/10.3390/rs17071118
  10. Li, S., Wang, H., et. al., (2020), “An adaptive data fusion strategy for fault diagnosis based on the convolutional neural network”, Journal of Applied Sciences, 165, 108122. DOI: org/10.1016/j.measurement.2020.108122
  11. Sarlin, P., Honkavaara, E., Vilkko, M., Hakala, T., Markelin, L. (2022), “Deep learning with RGB and thermal images onboard a drone for search and rescue operations”, Journal of Field Robotics, Vol. 39(7), pp. 1063-1086. DOI: https://doi.org/10.1002/rob.22082
  12. Hu, R., Li, Z., Yang, J. (2018), “Relation Networks for Object Detection”, Proceedings of the IEEE CVF Conference on Computer Vision and Pattern Recognition CVPR”, pp. 3588-3597. URL: https://cutt.ly/srTW49wS
  13. Ovchar, M.”Multimodal Object Detection GitHub”, URL: https://cutt.ly/prTW5i3W
  14. Valada, A., Vertens, J., Dhall, A., Burgard, W. (2019), “Self supervised model adaptation for multimodal semantic segmentation”, International Journal of Computer Vision, Vol. 128, 1239-1285 DOI: org/10.1007/s11263-019-01188-y
  15. Jiang, Z., Zhao, L., Li, S., Jia, Y. (2020), “Real-time object detection method based on improved YOLOv4-tiny”, arXiv preprint, DOI: org/10.48550/arXiv.2011.04244