Data Cleaning

Data Cleaning

A Practical Perspective

Venkatesh Ganti, Anish Das Sarma
ISBN: 9781608456772 | PDF ISBN: 9781608456789
Copyright © 2013 | 85 Pages | Publication Date: 09/01/2013

BEFORE YOU ORDER: You may have Academic or Corporate access to this title. Click here to find out: 10.2200/S00523ED1V01Y201307DTM036

Ordering Options: Paperback $30.00   E-book $24.00   Paperback & E-book Combo $37.50

Why pay full price? Members receive 15% off all orders.
Learn More Here

Read Our Digital Content License Agreement (pop-up)

Purchasing Options:

Data warehouses consolidate various activities of a business and often form the backbone for generating reports that support important business decisions. Errors in data tend to creep in for a variety of reasons. Some of these reasons include errors during input data collection and errors while merging data collected independently across different databases. These errors in data warehouses often result in erroneous upstream reports, and could impact business decisions negatively. Therefore, one of the critical challenges while maintaining large data warehouses is that of ensuring the quality of data in the data warehouse remains high. The process of maintaining high data quality is commonly referred to as data cleaning.

In this book, we first discuss the goals of data cleaning. Often, the goals of data cleaning are not well defined and could mean different solutions in different scenarios. Toward clarifying these goals, we abstract out a common set of data cleaning tasks that often need to be addressed. This abstraction allows us to develop solutions for these common data cleaning tasks. We then discuss a few popular approaches for developing such solutions. In particular, we focus on an operator-centric approach for developing a data cleaning platform. The operator-centric approach involves the development of customizable operators that could be used as building blocks for developing common solutions. This is similar to the approach of relational algebra for query processing. The basic set of operators can be put together to build complex queries. Finally, we discuss the development of custom scripts which leverage the basic data cleaning operators along with relational operators to implement effective solutions for data cleaning tasks.

Table of Contents

Technological Approaches
Similarity Functions
Operator: Similarity Join
Operator: Clustering
Operator: Parsing
Task: Record Matching
Task: Deduplication
Data Cleaning Scripts
Authors' Biographies

About the Author(s)

Venkatesh Ganti, Alation Inc.

Anish Das Sarma, Google Inc.

Related Series

Data Mining and Knowledge Discovery


Customers who bought this product also purchased
An Introduction to Duplicate Detection
An Introduction to Duplicate Detection
Browse by Subject
Case Studies in Engineering
ACM Books
IOP Concise Physics
SEM Books
0 items

Note: Registered customers go to: Your Account to subscribe.

E-Mail Address:

Your Name: