In this book, we aim to provide a fairly comprehensive overview of the scalability and efficiency challenges in large-scale web search engines. More specifically, we cover the issues involved in the design of three separate systems that are commonly available in every web-scale search engine: web crawling, indexing, and query processing systems. We present the performance challenges encountered in these systems and review a wide range of design alternatives employed as solution to these challenges, specifically focusing on algorithmic and architectural optimizations. We discuss the available optimizations at different computational granularities, ranging from a single computer node to a collection of data centers. We provide some hints to both the practitioners and theoreticians involved in the field about the way large-scale web search engines operate and the adopted design choices. Moreover, we survey the efficiency literature, providing pointers to a large number of relatively important research papers. Finally, we discuss some open research problems in the context of search engine efficiency.
Table of Contents
The Web Crawling System
The Indexing System
The Query Processing System
About the Author(s)B. Barla Cambazoglu
, Yahoo! Labs
B. Barla Cambazoglu received his B.S., M.S., and Ph.D. degrees, all in computer engineering, from the Computer Engineering Department of Bilkent University in 1997, 2000, and 2006, respectively. After getting his Ph.D. degree, he worked as a postdoctoral researcher in Bilkent University for a short period of time. In 2006, he joined the Biomedical Informatics Department of the Ohio State University as a postdoctoral researcher. In 2008, he joined Yahoo Labs as a postdoctoral researcher. He received research scientist and senior research scientist positions at the same institution, in 2010 and 2012, respectively. Between 2013 and 2015, he was a senior manager, heading the web retrieval group in Yahoo Labs Barcelona. His main research interests are distributed information retrieval and web search efficiency. In 2010, 2011, 2014, and 2015, he co-organized the LSDS-IR workshop. He was the proceedings chair for WSDM'09 and the poster and proceedings chairs for ECIR'12. He served as an area chair in SIGIR'13 and SIGIR'14. He regularly serves on the program committees of SIGIR, WWW, and KDD conferences. He has many papers published in prestigious journals including IEEE TPDS, JPDC, JASIST, Inf. Syst., ACM TWEB, and IP&M, as well as papers and tutorials presented at top-tier conferences, such as SIGIR, CIKM, WSDM, WWW, and KDD.Ricardo Baeza-Yates
, Yahoo! Labs
Ricardo Baeza-Yates has been VP of Research and Chief Research Scientist at Yahoo Labs, based in Sunnyvale, California, since August 2014. Before that, he founded and led the labs in Barcelona and Santiago de Chile from 2006-2015. Between 2008 and 2012 he also oversaw the Haifa lab. In addition, he is also a part-time Professor at the Department of Information and Communication Technologies of the Universitat Pompeu Fabra, in Barcelona, Spain, where in 2005 he was an ICREA research professor. Until 2004 he was a Professor, and before that founder and Director, of the Center for Web Research at the Department of Computing Science of the University of Chile (from where he is currently on a leave of absence). In 1989, he obtained a Ph.D. in computer science from the University of Waterloo, Canada. Before that, he obtained two master degrees (M.Sc. CS & M.Eng. EE) and an electronic engineering degree from the University of Chile in Santiago. He is co-author of the best-seller Modern Information Retrieval textbook, published in 1999 by Addison-Wesley, with a second enlarged edition in 2011, that won the ASIST 2012 Book of the Year award. He is also co-author of the 2nd edition of the Handbook of Algorithms and Data Structures, Addison-Wesley, 1991 and co-editor of Information Retrieval: Algorithms and Data Structures, Prentice-Hall, 1992. In addition, he is the author or co-author of more than 500 other publications. From 2002-2004, he was elected to the board of governors of the IEEE Computer Society and in 2012 he was elected for the ACM Council. He received the Organization of American States award for young researchers in exact sciences (1993), the Graham Medal for innovation in computing given by the University of Waterloo to distinguished ex-alumni (2007), the CLEI Latin American distinction for contributions to CS in the region (2009), and the National Award of the Chilean Association of Engineers (2010), among other distinctions. In 2003 he was the first computer scientist to be elected to the Chilean Academy of Sciences and since 2010 has been a founding member of the Chilean Academy of Engineering. In 2009 he was named ACM Fellow and in 2011 IEEE Fellow.