Dedupe Duplicates using Fuzzy / Proximity search
Last year I wrote a post about finding similar accounts for Dynamics CRM which generated lot of interest in the community. Understandably so, as this is a very common requirement that is asked for in nearly every CRM project – Duplicate Accounts. CRM duplicate detection capabilities are only basic – they just do partial match, they can’t do any fuzzy or proximity match.
Even with the latest and advanced weaponry in CRM’s armour i.e. Relevance Search it is not there yet where it could tell that the following accounts are infact the same companies.
Account |
Potential Duplicate |
Reason |
Waste Management |
Waset Manaegment |
Typo |
Public Storage Co. |
Storage Public Co. |
Wrong order |
Scotts Miracle-Gro |
Scott Miracles Gro |
Plural |
Melbourne University |
Melbourne Univ. |
Short form |
I decided to improve and generalise my code a bit, so that it can be used not only for CRM for any general requirement where you need to find duplicates based on proximity. I am going to share the code and approach in this blog.
Approach
This proximity search is based on the machine learning algorithms which base the search on Edit Distance. The program starts with finding the exact matches first, if it couldn’t find an exact match, then it widens the search filter to find partial and proximity matches (i.e. words in the same neighbourhood, ordered in a different way, etc.)
Results
I have also attached the original files that I used during my testing i.e. the file containing duplicates and the results (where duplicates were found). Below is the brief snapshot of the results from my test run
Company |
Duplicate Found |
Kimberly-Clark |
Kimberly Clark |
San disk |
SanDisk |
Macy’s |
Macy |
Starwood Hotels & Resorts |
Starwood Hotels And Resorts |
Expeditors Washington |
Expeditors International of Washington |
There were some false positives in the results as well, so you can adjust the thresholds of the algorithm as per your data.
How to use
You got a list of companies and you want to know which of them are duplicates. So, this is what you need to do.
1. Export the list into a CSV file.
2. Point the code to your file.
3. Run the code and it generates a new file results.csv with a new column called Duplicate
Complete source code
Python is a beautiful language and does big things in just few lines of code. Just install Python on your desktop and run the following file. No frills, no servers, no deployment. Too easy.
# ProximitySearch.py # AUTHOR - MANNY GREWAL 2017 (https://mannygrewal.wordpress.com) # THIS CODE WILL DO FUZZY SEARCH FOR SEARCH TERM INSIDE A DATABASE. THE PRECENDENCE OF SEARCH STARTS WITH THE # NARROWEST FILTER WHICH SLOWLY WIDENES UP. THE IDEA IS TO GET TO PERFECT MATCHES BEFORE NEAR MATCHES. EACH # FILTER HAS ITS OWN THRESHOLD CUTOFF. #IMPORT THE PACKAGES NEEDED import pandas as pd from fuzzywuzzy import fuzz from fuzzywuzzy import process import csv import os #DEFINE AND CONFIGURE FULL_MATCHING_THRESHOLD = 80 PARTIAL_MATCHING_THRESHOLD = 100 SORT_MATCHING_THRESHOLD = 100 TOKEN_MATCHING_THRESHOLD = 100 MAX_MATCHES=1 #READ THE CURRENT DATABASE companies_db = "<local path of your CSV file>/CompaniesShort.csv" pwd = os.getcwd() os.chdir(os.path.dirname(companies_db)) current_db_dataframe = pd.read_csv(os.path.basename(companies_db),skiprows=1,index_col=False, names=['Company']) os.chdir(pwd) def find_matches(matchThis): rows = current_db_dataframe['Company'].values.tolist(); rows.remove(matchThis) matches= process.extractBests(matchThis,rows,scorer=fuzz.ratio,score_cutoff=FULL_MATCHING_THRESHOLD,limit=MAX_MATCHES) if len(matches)==0: matches= process.extractBests(matchThis,rows,scorer=fuzz.partial_ratio,score_cutoff=PARTIAL_MATCHING_THRESHOLD,limit=MAX_MATCHES); if len(matches)==0: matches= process.extractBests(matchThis,rows,scorer=fuzz.token_set_ratio,score_cutoff=TOKEN_MATCHING_THRESHOLD,limit=MAX_MATCHES); if len(matches)==0: matches= process.extractBests(matchThis,rows,scorer=fuzz.token_sort_ratio,score_cutoff=SORT_MATCHING_THRESHOLD,limit=MAX_MATCHES); return matches[0][0] if len(matches)>0 else None fn_find_matches = lambda x: find_matches(x) current_db_dataframe['Duplicate']=current_db_dataframe.applymap(fn_find_matches) current_db_dataframe.to_csv("results.csv")

This was originally posted here.
*This post is locked for comments