22Aug 2016

A Progressive Approach for Duplicate Detection with Map Reduce

  • Department of Information Technology, RSET, India.
  • Abstract
  • Keywords
  • Cite This Article as
  • Corresponding Author

Duplicate detection is a crucial step for data quality and data integration. Cloud infrastructure is a popular paradigm that enables efficient parallel processing of data and computational intensive tasks such as duplicate detection on a very large datasets. Different cloud computing applications make use of a programming model called MapReduce that supports parallel execution of data-intensive computing tasks in cluster environments with up to thousands of nodes. Most of the duplicate pairs in the duplicate detection process can be identified as early as possible by using a method called Progressive Sorted Neighbourhood. To reduce the typically high execution times, this paper investigate how progressive sorted neighborhood for data intensive duplicate detection can be realized in a cloud infrastructure using MapReduce. This paper mainly focuses on the use of MapReduce programming model aiming at a highly efficient duplicate detection implementation for a very large datasets.


[Manju V J and Chinchu Krishna S. (2016); A Progressive Approach for Duplicate Detection with Map Reduce Int. J. of Adv. Res. 4 (Aug). 52-55] (ISSN 2320-5407). www.journalijar.com


Manju V J