Object Detection Basics
Overview
Localizing objects in images accurately is an extremely useful technology which enables various applications for example regulating traffic.
This project provides a brief history of object detection and present the three most common algorithms in detail.
Topics covered
Some of the covered topics include:
Different goals of object recognition and object detection as one such goals
Datasets and their evolution for evaluating object recognition models
Explanation of evaluation metrics for object detection models, such as Average Precision, Average Recall on different scales, etc.
Introduction to detection before deep learning with definition of a feature in computer vision. Based on this, the traditional feature extractor models Scale-invariant Feature Transform (SIFT) and Histogram of Oriented Gradients (HOG) are explained.
An introduction to the convolutional neural network architecture
Detailed explanations of the Faster-R-CNN, You Only Look Once (YOLO), and The Detection Transformer (DETR) object detection model family architecture
Project links
References
Bochkovskiy, A., Wang, C.-Y., & Liao, H.-Y. M. (2020). YOLOv4: Optimal Speed and Accuracy of Object Detection. ArXiv:2004.10934 [Cs, Eess]. http://arxiv.org/abs/2004.10934
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-End Object Detection with Transformers. ArXiv:2005.12872 [Cs]. http://arxiv.org/abs/2005.12872
Comaniciu, D., & Meer, P. (2002). Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5), 603–619. https://doi.org/10.1109/34.1000236
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. https://doi.org/10.1007/BF00994018
Dalal, N., & Triggs, B. (2005). Histograms of Oriented Gradients for Human Detection. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), 1, 886–893. https://doi.org/10.1109/CVPR.2005.177
Doshi, K. (2020, December 13). Transformers Explained Visually. https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 88(2), 303–338. https://doi.org/10.1007/s11263-009-0275-4
Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2015). The Pascal Visual Object Classes Challenge: A Retrospective. International Journal of Computer Vision, 111(1), 98–136. https://doi.org/10.1007/s11263-014-0733-5
Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36, 193–202.
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. ArXiv:1311.2524 [Cs]. http://arxiv.org/abs/1311.2524
Girshick, R. (2015). Fast R-CNN. ArXiv:1504.08083 [Cs]. http://arxiv.org/abs/1504.08083
Hosang, J., Benenson, R., Dollár, P., & Schiele, B. (2016). What makes for effective detection proposals? IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(4), 814–830. https://doi.org/10.1109/TPAMI.2015.2465908
Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ArXiv:1502.03167 [Cs]. http://arxiv.org/abs/1502.03167
Jocher, G. (2020). YOLOv5. https://github.com/ultralytics/yolov5
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386
LeCun, Y., Boser, B. E., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. E., & Jackel, L. D. (1990). Handwritten Digit Recognition with a Back-Propagation Network. 9.
Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., & Dollár, P. (2015). Microsoft COCO: Common Objects in Context. ArXiv:1405.0312 [Cs]. http://arxiv.org/abs/1405.0312
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature Pyramid Networks for Object Detection. ArXiv:1612.03144 [Cs]. http://arxiv.org/abs/1612.03144
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2018). Focal Loss for Dense Object Detection. ArXiv:1708.02002 [Cs]. http://arxiv.org/abs/1708.02002
Li, E. Y. (2019, December 30). Dive Really Deep into YOLO v3: A Beginner’s Guide. https://towardsdatascience.com/dive-really-deep-into-yolo-v3-a-beginners-guide-9e3d2666280e
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully Convolutional Networks for Semantic Segmentation. ArXiv:1411.4038 [Cs].http://arxiv.org/abs/1411.4038
Long, X., Deng, K., Wang, G., Zhang, Y., Dang, Q., Gao, Y., Shen, H., Ren, J., Han, S., Ding, E., & Wen, S. (2020). PP-YOLO: An Effective and Efficient Implementation of Object Detector. ArXiv:2007.12099 [Cs]. http://arxiv.org/abs/2007.12099
Lowe, D. G. (2004). Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60(2), 91–110. https://doi.org/10.1023/B:VISI.0000029664.99615.94
Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You Only Look Once: Unified, Real-Time Object Detection. ArXiv:1506.02640 [Cs]. http://arxiv.org/abs/1506.02640
Redmon, J., & Farhadi, A. (2016). YOLO9000: Better, Faster, Stronger. ArXiv:1612.08242 [Cs]. http://arxiv.org/abs/1612.08242
Redmon, J., & Farhadi, A. (2018). YOLOv3: An Incremental Improvement. ArXiv:1804.02767 [Cs]. http://arxiv.org/abs/1804.02767
Ren, S., He, K., Girshick, R., & Sun, J. (2016). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. ArXiv:1506.01497 [Cs]. http://arxiv.org/abs/1506.01497
Simonyan, K., & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. ArXiv:1409.1556 [Cs].http://arxiv.org/abs/1409.1556
Shi, J., & Malik, J. (2000). Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 18.
Szegedy, C., Toshev, A., Erhan, D. (2013). Deep Neural Networks for Object Detection. NeurIPS 2013. https://papers.nips.cc/paper/2013/file/f7cade80b7cc92b991cf4d2806d6bd78-Paper.pdf
Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. CVPR 2011, 1521–1528. https://doi.org/10.1109/CVPR.2011.5995347
Uijlings, J. R. R., van de Sande, K. E. A., Gevers, T., & Smeulders, A. W. M. (2013). Selective Search for Object Recognition. International Journal of Computer Vision, 104(2), 154–171. https://doi.org/10.1007/s11263-013-0620-5
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. ArXiv:1706.03762 [Cs]. http://arxiv.org/abs/1706.03762
Wertheimer, M., Laws of organization in perceptual forms (partial translation). W. B. Ellis, editor, A Sourcebook of Gestalt Psychology, pages 71-88. Harcourt, Brace and Company, 1938.
Zeiler, M. D., & Fergus, R. (2013). Visualizing and Understanding Convolutional Networks. ArXiv:1311.2901 [Cs].http://arxiv.org/abs/1311.2901