The time-worn aphorism "close only counts in horseshoes and hand-grenades" is clearly inadequate. Close also counts in golf, shuffleboard, archery, darts, curling, and other games of accuracy in which hitting the precise center of the target isn't to be expected every time, or in which we can expect to be driven from the target by skilled opponents.
This lecture is not devoted to sports discussions, but to efficient algorithms for determining pairs of closely related web pages - and a few other situations in which we have found that inexact matching is good enough; where proximity suffices. We will not, however, attempt to be comprehensive in the investigation of probabilistic algorithms, approximation algorithms, or even techniques for organizing the discovery of nearest neighbors. We are more concerned with finding nearby neighbors; if they are not particularly close by, we are not particularly interested.
In thinking of when approximation is sufficient, remember the oft-told joke about two campers sitting around after dinner. They hear noises coming towards them. One of them reaches for a pair of running shoes, and starts to don them. The second then notes that even with running shoes, they cannot hope to outrun a bear, to which the first notes that most likely the bear will be satiated after catching the slower of them. We seek problems in which we don't need to be faster than the bear, just faster than the others fleeing the bear.
Table of Contents
Foreword to the First Edition
Comparing Web Pages for Similarity: An Overview
A Personal History of Web Search
Uniform Sampling after Alta Vista
Why Weight (and How)?
A Few Applications
Forks in the Road: Flajolet and Slightly Biased Sampling
About the Author(s)Mark S. Manasse
Mark Manasse was a Principal Researcher at Microsoft Research, which he joined in 2001, while writing the first edition of this book, and where he performed the research presented in the additional chapters that comprise the additional work presented in this second edition. From 1985 until he joined Microsoft, Mark was a researcher at Compaq's Systems Research Center in Palo Alto, California (previously Digital Equipment Corporation, subsequently Hewlett-Packard and now extinct). Mark worked at Microsoft until late 2014. He is now a Principal Architect (working on infrastructure security) at Salesforce, which he thanks for their support while writing the final chapter of this second edition.
Mark Manasse works in a variety of theory-related areas of distributed computer systems research. He was the inventor of MilliCent; as such, Wired Magazine dubbed him "the guru of micropayments," and he was co-chair of the microcommerce working group for the World Wide Web Consortium. Mark has worked on Web search technologies; with Andrei Broder, Steve Glassman, and Geoff Zweig, his work on syntactic similarity was awarded best paper at the Sixth International World Wide Web Conference. Mark was a member the design committee for the Inter-Client Communications Manual for the X Window System. Mark's work on on-line algorithms helped to establish this field, and remain among his most often cited papers. Mark organized, ran, and developed much of the code for some of the earliest uses of the Internet in distributed computations when he and Arjen Lenstra factored many large integers, the most noteworthy being the first factorization of a "hard" 100-digit number, and the factorization of the ninth Fermat number; for several years thereafter, Markâ€™s license plate read "IDIDF9," leaving most other drivers puzzled. Mark holds U.S. patents in three of the previously mentioned areas. His doctorate was earned at the University of Wisconsin in Mathematical Logic in 1982, and he spent the following three years at Bell Labs and the University of Chicago. Mark's projects after joining Microsoft included Koh-i-Noor, PageTurner, Dryad, a minor role in Penny Black, and work in various unnamed projects. Additionally, Mark worked on aspects of deduplication with product groups in MSN Search (now Bing) and with the Windows Server group on aspects of file systems and storage, starting with Windows Server 2003, and continuing through Windows 8. In 1994, Newsweek described Severe Tire Damage (the band Mark helped found and for which he played bass) as "lesser-known" than the Rolling Stones, following STD's unauthorized appearance as the opening act in a multicast performance headlined by the Stones. The band is content with that.