Object Detection Using Computer Vision
Keywords:
Object Detection, Computer Vision, Deep LearningAbstract
Object detection has become an important technology in artificial intelligence and computer vision. Many industries now use this technology, including self-driving cars, medical image analysis, security systems, and manufacturing. This paper presents a detailed study of object detection methods through three main parts. The first part covers the basic theories and concepts of object detection. We explain the differences between image classification and object detection, and describe the main parts of detection systems. We also discuss how to measure performance using average precision (AP) and mean average precision (mAP). The section includes a review of how detection models have developed over time, from older methods to modern one-stage and two-stage approaches. We provide practical examples using Faster R-CNN. The second part focuses on the YOLO (You Only Look Once) algorithm, which represents single-stage detection methods. YOLO is popular because it works fast while keeping good accuracy. We explain how YOLO works, including how it divides images into grids and predicts bounding boxes and object types at the same time. We show detailed code examples for detecting objects in still images using YOLOv5. The last part extends to video object detection, which brings new challenges for real-time processing. We discuss various techniques to improve performance, such as model quantization, pruning, knowledge distillation, and combining object tracking with detection using DeepSORT. We demonstrate practical applications by showing how to detect and track cars in traffic videos. The demonstrations illustrate how pretrained object detection models, particularly Faster R-CNN and YOLOv5, can be applied to still images and video frames for instructional purposes. The article does not aim to provide a systematic benchmark comparison, but rather to present practical implementation examples for readers who are beginning to use object detection techniques.
References
Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B. (2016). Simple online and realtime tracking. In 2016 IEEE International Conference on Image Processing (ICIP) (pp. 3464–3468). IEEE. https://doi.org/10.1109/ICIP.2016.7533003
Bochkovskiy, A., Wang, C.-Y., & Liao, H.-Y. M. (2020). YOLOv4: Optimal speed and accuracy of object detection. arXiv. https://arxiv.org/abs/2004.10934
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In A. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), Computer vision – ECCV 2020 (Lecture Notes in Computer Science, Vol. 12346, pp. 213–229). Springer. https://doi.org/10.1007/978-3-030-58452-8_13
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 580–587). IEEE. https://doi.org/10.1109/CVPR.2014.81
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770–778). IEEE. https://doi.org/10.1109/CVPR.2016.90
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv. https://arxiv.org/abs/1704.04861
levan92. (n.d.). deep_sort_realtime [Computer software]. GitHub. Retrieved June 16, 2026, from https://github.com/levan92/deep_sort_realtime
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2117–2125). IEEE.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2980–2988). IEEE. https://doi.org/10.1109/ICCV.2017.324
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In D. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer vision – ECCV 2014 (Lecture Notes in Computer Science, Vol. 8693, pp. 740–755). Springer. https://doi.org/10.1007/978-3-319-10602-1_48
Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., & Pietikäinen, M. (2020). Deep learning for generic object detection: A survey. International Journal of Computer Vision, 128(2), 261–318. https://doi.org/10.1007/s11263-019-01247-4
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). SSD: Single shot multibox detector. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), Computer vision – ECCV 2016 (Lecture Notes in Computer Science, Vol. 9905, pp. 21–37). Springer. https://doi.org/10.1007/978-3-319-46448-0_2
OpenCV. (n.d.). OpenCV documentation. Retrieved June 16, 2026, from https://docs.opencv.org/4.x/
Park, J., Lee, J., & Jeong, J. (2024). LyFormer based object detection in reel package X-ray images of semiconductor component. Journal of King Saud University – Computer and Information Sciences, 36(1), Article 101859. https://doi.org/10.1016/j.jksuci.2023.101859
PyTorch. (n.d.). fasterrcnn_resnet50_fpn. Torchvision documentation. Retrieved June 16, 2026, from https://docs.pytorch.org/vision/stable/models/generated/torchvision.models.detection.fasterrcnn_resnet50_fpn.html
Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7263–7271). IEEE. https://doi.org/10.1109/CVPR.2017.690
Redmon, J., & Farhadi, A. (2018). YOLOv3: An incremental improvement. arXiv. https://arxiv.org/abs/1804.02767
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 779–788). IEEE. https://doi.org/10.1109/CVPR.2016.91
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 28, pp. 91–99). Curran Associates, Inc.
Rosebrock, A. (2016). Intersection over Union (IoU) for object detection. PyImageSearch. https://pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4510–4520). IEEE. https://doi.org/10.1109/CVPR.2018.00474
Shah, D. (2022, March 7). Mean average precision (mAP) explained: Everything you need to know. V7 Labs. https://www.v7labs.com/blog/mean-average-precision
Sokolan, R. (n.d.). Car and truck traffic on the highway in Europe, Poland – Summer day [Stock video]. Vecteezy. Retrieved June 16, 2026, from https://www.vecteezy.com/video/7957364-car-and-truck-traffic-on-the-highway-in-europe-poland-summer-day
Tan, M., Pang, R., & Le, Q. V. (2020). EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10781–10790). IEEE. https://doi.org/10.1109/CVPR42600.2020.01079
Ultralytics. (n.d.). Loading YOLOv5 from PyTorch Hub. Ultralytics Docs. Retrieved June 16, 2026, from https://docs.ultralytics.com/yolov5/tutorials/pytorch_hub_model_loading
Wang, C.-Y., Liao, H.-Y. M., Wu, Y.-H., Chen, P.-Y., Hsieh, J.-W., & Yeh, I.-H. (2020). CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (pp. 1571–1580). IEEE Computer Society. https://doi.org/10.1109/CVPRW50498.2020.00203
Wojke, N., Bewley, A., & Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. In 2017 IEEE International Conference on Image Processing (ICIP) (pp. 3645–3649). IEEE. https://doi.org/10.1109/ICIP.2017.8296962
Yohanandan, S. (2020, June 9). mAP (mean average precision) might confuse you! Medium. https://medium.com/data-science/map-mean-average-precision-might-confuse-you-5956f1bfa9e2
Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., & Wang, X. (2022). ByteTrack: Multi-object tracking by associating every detection box. In S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, & T. Hassner (Eds.), Computer vision – ECCV 2022 (Lecture Notes in Computer Science, Vol. 13682, pp. 1–21). Springer. https://doi.org/10.1007/978-3-031-20047-2_1
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Jiramate Rujikornhirun

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
All published content in JRM is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0).