Using Phonetics for Duplication Checking

  • Comments 1
When it comes to adding new entities within Dynamics (whether they’re names, addresses, product-masters or project-files) you want to ensure that your operators don’t end up adding the same or similar data (with subtle spelling variations). This causes reporting-problems and time often needs to be scheduled to perform de-duplication and recoding.
 

·         In some situations (like the legal industry) it is vital that this problem is caught before entity-creation (rather than after) otherwise a legal “conflict-of-interest” may occur.

 
Soundex” & “Difference” are a couple of very useful pattern matching algorithms within SQL Server. Unfortunately, there are not many Dynamics-houses that use them for entity duplication checking, simply because there is no natural implementation within X++.
 

·         Soundex: Converts a string into a four-character phonetic code. More information about the [Soundex] function can be found here: [http://msdn.microsoft.com/en-us/library/ms187384.aspx]

·         Difference: Compares two soundex codes and returns an integer representing the difference between them. More information about the [Difference] function can be found here: [http://msdn.microsoft.com/en-us/library/ms188753.aspx]

 
The whole purpose of these functions is to find “similar-sounding” names in your data (or names that have been mistyped).
 
For example: the english name “Scarborough” is sometimes mistyped as “Scarbrough” (one less “O”)… or a german name like “Lauhornz” could have been (lazily) typed in as “Lawhorns” (by someone who doesn’t know the correct pronounciation) … or more simply “Conells” (single “N” double “L”) could have already exist as “Connels” (double “N” single “L”)... even data taken “over-the-phone” or via “voice-recognition-systems” could be subject to silent lettering misinterpretations (e.g. “Fisher” should have been corrected to “Fischer”)… the list goes on!
 
The difficulty with “wild-card” fuzzy matching in Dynamics is that you must try word-portions when searching for potential duplicates. With “Lawhorns” you would have to replace the “w” with a “u” and possibly the “s” with a “z” to cover all pronounciation variants. You would also need an understanding of the way vowels & consonants sound in German. This is not a trivial thing to code into an algorithm and inevitably this becomes a human-verification task. Equally, with “Connells” a human-operator would have to check “CON*”. If this brought back too many (or irrelevant) results then they would have to fine tune the search by trying “CONEL*” and then even “CONNEL*” to cover all possible spelling possibilities. This requires “x” number of searches and is both time-consuming and frustrating for operators (often leading to duplicate data being entered regardless).
 
Making the job easier to spot potential matches is the key and an automated pattern matching algorithm that operates on “phonetic” sounds as opposed to fuzzy matching on “word-fragments” is the answer.
 
I’m going to start work in follow-up articles on a prototype that will recreate this functionality within Dynamics and demonstrate its usefulness when combined with things like entity-inception and voice recognition.
 
 
REGARDS
 
 
  • Greate job Khalid!

    I am looking forward to your next post because this is a topic also on our roadmap....

    Regards,

    Paul