Dedupe Duplicates using Fuzzy / Proximity search

Views (2451)

MP Grewal

Like(0)

Report

Last year I wrote a post about finding similar accounts for Dynamics CRM which generated lot of interest in the community. Understandably so, as this is a very common requirement that is asked for in nearly every CRM project – Duplicate Accounts. CRM duplicate detection capabilities are only basic – they just do partial match, they can’t do any fuzzy or proximity match.

Even with the latest and advanced weaponry in CRM’s armour i.e. Relevance Search it is not there yet where it could tell that the following accounts are infact the same companies.

Account	Potential Duplicate	Reason
Waste Management	Waset Manaegment	Typo
Public Storage Co.	Storage Public Co.	Wrong order
Scotts Miracle-Gro	Scott Miracles Gro	Plural
Melbourne University	Melbourne Univ.	Short form

I decided to improve and generalise my code a bit, so that it can be used not only for CRM for any general requirement where you need to find duplicates based on proximity. I am going to share the code and approach in this blog.

Approach

This proximity search is based on the machine learning algorithms which base the search on Edit Distance. The program starts with finding the exact matches first, if it couldn’t find an exact match, then it widens the search filter to find partial and proximity matches (i.e. words in the same neighbourhood, ordered in a different way, etc.)

Results

I have also attached the original files that I used during my testing i.e. the file containing duplicates and the results (where duplicates were found). Below is the brief snapshot of the results from my test run

Company	Duplicate Found
Kimberly-Clark	Kimberly Clark
San disk	SanDisk
Macy’s	Macy
Starwood Hotels & Resorts	Starwood Hotels And Resorts
Expeditors Washington	Expeditors International of Washington

There were some false positives in the results as well, so you can adjust the thresholds of the algorithm as per your data.

How to use

You got a list of companies and you want to know which of them are duplicates. So, this is what you need to do.

1. Export the list into a CSV file.

2. Point the code to your file.

3. Run the code and it generates a new file results.csv with a new column called Duplicate

Complete source code

Python is a beautiful language and does big things in just few lines of code. Just install Python on your desktop and run the following file. No frills, no servers, no deployment. Too easy.

# ProximitySearch.py
# AUTHOR - MANNY GREWAL 2017 (https://mannygrewal.wordpress.com)
# THIS CODE WILL DO FUZZY SEARCH FOR SEARCH TERM INSIDE A DATABASE. THE PRECENDENCE OF SEARCH STARTS WITH THE # NARROWEST FILTER WHICH SLOWLY WIDENES UP. THE IDEA IS TO GET TO PERFECT MATCHES BEFORE NEAR MATCHES. EACH  # FILTER HAS ITS OWN THRESHOLD CUTOFF.

#IMPORT THE PACKAGES NEEDED
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import csv
import os


#DEFINE AND CONFIGURE
FULL_MATCHING_THRESHOLD = 80
PARTIAL_MATCHING_THRESHOLD = 100
SORT_MATCHING_THRESHOLD = 100
TOKEN_MATCHING_THRESHOLD = 100
MAX_MATCHES=1

#READ THE CURRENT DATABASE
companies_db = "<local path of your CSV file>/CompaniesShort.csv"
pwd = os.getcwd()
os.chdir(os.path.dirname(companies_db))
current_db_dataframe = pd.read_csv(os.path.basename(companies_db),skiprows=1,index_col=False, names=['Company'])
os.chdir(pwd)

def find_matches(matchThis):
    rows = current_db_dataframe['Company'].values.tolist();
    rows.remove(matchThis)
    matches= process.extractBests(matchThis,rows,scorer=fuzz.ratio,score_cutoff=FULL_MATCHING_THRESHOLD,limit=MAX_MATCHES)
    if len(matches)==0:
        matches= process.extractBests(matchThis,rows,scorer=fuzz.partial_ratio,score_cutoff=PARTIAL_MATCHING_THRESHOLD,limit=MAX_MATCHES);
        if len(matches)==0:
            matches= process.extractBests(matchThis,rows,scorer=fuzz.token_set_ratio,score_cutoff=TOKEN_MATCHING_THRESHOLD,limit=MAX_MATCHES);
            if len(matches)==0:
                matches= process.extractBests(matchThis,rows,scorer=fuzz.token_sort_ratio,score_cutoff=SORT_MATCHING_THRESHOLD,limit=MAX_MATCHES);
    
    return matches[0][0] if len(matches)>0 else None


fn_find_matches = lambda x: find_matches(x)
current_db_dataframe['Duplicate']=current_db_dataframe.applymap(fn_find_matches)

current_db_dataframe.to_csv("results.csv")

This was originally posted here.

Comments

*This post is locked for comments

Community site session details

Dedupe Duplicates using Fuzzy / Proximity search

Approach

Results

How to use

Complete source code

Comments

Responsible AI policies

Neeraj Kumar – Community Spotlight

Congratulations to the November Top 10 Community Leaders!