Face detection, because of its vast array of applications, is one of the most active research areas in computer vision. In this book, we review various approaches to face detection developed in the past decade, with more emphasis on boosting-based learning algorithms. We then present a series of algorithms that are empowered by the statistical view of boosting and the concept of multiple instance learning. We start by describing a boosting learning framework that is capable to handle billions of training examples. It differs from traditional bootstrapping schemes in that no intermediate thresholds need to be set during training, yet the total number of negative examples used for feature selection remains constant and focused (on the poor performing ones). A multiple instance pruning scheme is then adopted to set the intermediate thresholds after boosting learning. This algorithm generates detectors that are both fast and accurate.
We then present two multiple instance learning schemes for face detection, multiple instance learning boosting (MILBoost) and winner-take-all multiple category boosting (WTA-McBoost). MILBoost addresses the uncertainty in accurately pinpointing the location of the object being detected, while WTA-McBoost addresses the uncertainty in determining the most appropriate subcategory label for multiview object detection. Both schemes can resolve the ambiguity of the labeling process and reduce outliers during training, which leads to improved detector performances.
In many applications, a detector trained with generic data sets may not perform optimally in a new environment. We propose detection adaption, which is a promising solution for this problem. We present an adaptation scheme based on the Taylor expansion of the boosting learning objective function, and we propose to store the second order statistics of the generic training data for future adaptation. We show that with a small amount of labeled data in the new environment, the detector's performance can be greatly improved.
We also present two interesting applications where boosting learning was applied successfully. The first application is face verification for filtering and ranking image/video search results on celebrities. We present boosted multi-task learning (MTL), yet another boosting learning algorithm that extends MILBoost with a graphical model. Since the available number of training images for each celebrity may be limited, learning individual classifiers for each person may cause overfitting. MTL jointly learns classifiers for multiple people by sharing a few boosting classifiers in order to avoid overfitting. The second application addresses the need of speaker detection in conference rooms. The goal is to find who is speaking, given a microphone array and a panoramic video of the room. We show that by combining audio and visual features in a boosting framework, we can determine the speaker's position very accurately. Finally, we offer our thoughts on future directions for face detection.
Table of Contents: A Brief Survey of the Face Detection Literature / Cascade-based Real-Time Face Detection / Multiple Instance Learning for Face Detection / Detector Adaptation / Other Applications / Conclusions and Future Work
Table of Contents
A Brief Survey of the Face Detection Literature
Cascade-based Real-Time Face Detection
Multiple Instance Learning for Face Detection
Conclusions and Future Work
About the Author(s)Cha Zhang
, Microsoft Research
Cha Zhang is a Researcher in the Communication and Collaboration Systems Group at Microsoft Research (Redmond, WA). He received the B.S. and M.S. degrees from Tsinghua University, Beijing, China in 1998 and 2000, respectively, both in Electronic Engineering, and the Ph.D. degree in Electrical and Computer Engineering from Carnegie Mellon University, in 2004. His current research focuses on applying various machine learning and computer graphics/computer vision techniques to multimedia applications, in particular, multimedia teleconferencing. During his graduate studies at CMU, he worked on various multimedia related projects including sampling and compression of image-based rendering data, 3D model database retrieval and active learning for database annotation, peer-to-peer networking, etc. Dr. Zhang has published more than 40 technical papers and holds 10+ U.S. patents. He won the best paper award at ICME 2007, the top 10% award at MMSP 2009, and the best student paper award at ICME 2010. He co-authored a book titled Light Field Sampling
, published by Morgan and Claypool in 2006. Dr. Zhang is a Senior Member of IEEE. He was the Publicity Chair for International Packet Video Workshop in 2002, the Program Co-Chair for the first Immersive Telecommunication Conference (IMMERSCOM) in 2007, the Steering Committee Co-Chair and Publicity Chair for IMMERSCOM 2009, the Program Co-Chair for the ACM Workshop on Media Data Integration (in conjunction with ACM Multimedia 2009), and the Poster&Demo Chair for ICME 2011. He served as TPC members for many conferences including ACM Multimedia, CVPR, ICCV, ECCV, MMSP, ICME, ICPR, ICWL, etc. He served as an Associate Editor for Journal of Distance Education Technologies, IPSJ Transactions on Computer Vision and Applications, and ICST Transactions on Immersive Telecommunications. He was a guest editor for Advances in Multimedia, Special Issue on Multimedia Immersive Technologies and Networking.Zhengyou Zhang
, Microsoft Research
Zhengyou Zhang received the B.S. degree in electronic engineering from the University of Zhejiang, Hangzhou, China, in 1985, the M.S. degree in computer science from the University of Nancy, Nancy, France, in 1987, and the Ph.D. degree in computer science and the Doctorate of Science from the University of Paris XI, Paris, France, in 1990 and 1994, respectively. He is a Principal Researcher with Microsoft Research, Redmond, WA, USA, and he manages the multimodal collaboration research team. Before joining Microsoft Research in March 1998, he was with INRIA (French National Institute for Research in Computer Science and Control), France, for 11 years and was a Senior Research Scientist from 1991. In 1996-1997, he spent a one year sabbatical as an Invited Researcher with the Advanced Telecommunications Research Institute International (ATR), Kyoto, Japan. He has published over 200 papers in refereed international journals and conferences, and he has coauthored the following books: 3-D Dynamic Scene Analysis: A Stereo Based Approach
(Springer-Verlag, 1992); Epipolar Geometry in Stereo, Motion and Object Recognition
(Kluwer, 1996); Computer Vision
(Chinese Academy of Sciences, 1998, 2003, in Chinese); and Face Geometry and Appearance Modeling
(Cambridge University Press, 2010, to appear). He has given a number of keynotes in international conferences. Dr. Zhang is a Fellow of the Institute of Electrical and Electronic Engineers (IEEE), the Founding Editor-in-Chief of the IEEE Transactions on Autonomous Mental Development, an Associate Editor of the International Journal of Computer Vision, and an Associate Editor of Machine Vision and Applications. He served as Associate Editor of the IEEE Transactions on Pattern Analysis and Machine Intelligence from 2000 to 2004, an Associate Editor of the IEEE Transactions on Multimedia from 2004 to 2009, among others. He has been on the program committees for numerous international conferences in the areas of autonomous mental development, computer vision, signal processing, multimedia, and human-computer interaction. He is a Program Co-Chair of the International Conference on Multimedia and Expo (ICME), July 2010, a Program Co-Chair of the ACM International Conference on Multimedia (ACM MM), October 2010, and a Program Co-Chair of the ACM International Conference on Multimodal Interfaces (ICMI), November 2010.