Similar names

 Oracle  Comments Off on Similar names
Dec 232011
 

We have the following task: Table1 contains the clients of the biggest grocery store in Munich, and Table2 all the fans of Bayern (Munich). The question is how many Bayern fans visit the grocery.
Let’s assume we have the columns First Name (FNAME), Last Name (LNAME), CITY, COUNTRY.

The easiest solution is:

select *

  from table2 t2

 where exists (select 1

          from table1 t1

         where t1.fname = t2.fname

           and t1.lname = t2.lname

           and t1.city = t2.city

           and t1.country = t2.country)

But this will give us the result if all names are written correctly. Which, unfortunately, happens rarely. For example Mr. Schmitt from the fans may be written as Schmidt in the grocery; Mrs Pavlovich may be Pavlovic; John Smith is not the same as John Smit, etc. We need something better.

There is one SQL function which returns a character string containing the phonetic representation a string. The function is soundex. The algorithm behind it is relatively simple and can be found here. Let’s try it:

select *

  from table2 t2

 where exists (select 1

          from table1 t1

         where soundex(t1.fname) = soundex(t2.fname)

           and soundex(t1.lname) = soundex(t2.lname)

           and t1.city = t2.city

           and t1.country = t2.country)

(to make the things simple, i have assumed the CITY and COUNTRY are written fine. We can use the same transformation on them)

But this gives us too many false positives. I do not think Marko Mjagkov is the same as Marica Moisejevs, but soundex returns the same – M620 M221. We need something else

There is one nice, relatively new package, named utl_match. In this package there are functions using two different algorithms facing this problem. The first one is developed in Russia by Vladimir Levenshtein in 1965 (the function EDIT_DISTANCE). The other seems to be developed in US, by Matthew Jaro and William Winkler, published in 1990. Let’s see what we can do using the latter:

select *

  from table2 t2

 where exists (select 1

          from table1 t1

         where utl_match.jaro_winkler(t1.fname, t2.fname) > 0.9

           and utl_match.jaro_winkler(t1.lname, t2.lname) > 0.9

           and t1.city = t2.city

           and t1.country = t2.country)

Hm, we have another problem – there is no effective way to speedup the query. If the grocery has 10 000 clients in Munich, and the football cub has 500 000 fans in the city, we will execute the function utl_match.jaro_winkler more than 5 000 000 000 times. It will, for example, try to compare Adolf with Gretta, which is nonsense. One cannot index this, or use another infrastructure tricks – only raw computing power will help. So it takes time.

And so, I decided to combine both approaches and came up with this:

select *

  from table2 t2

 where exists (select 1

          from table1 t1

         where soundex(t1.fname) = soundex(t2.fname)

           and soundex(t1.lname) = soundex(t2.lname)

           and utl_match.jaro_winkler(t1.fname, t2.fname) > 0.9

           and utl_match.jaro_winkler(t1.lname, t2.lname) > 0.9

           and t1.city = t2.city

           and t1.country = t2.country)

Here soundex limits the number of checks, although there may be many false positives. One can even do a function-based index on Soundex, or a MView.
Then utl_match.jaro_winkler gives us the most “similar” names. Believe it or not, this works fine – on tables with 20К and 2M rows (all from the same city), it completed in tens of seconds and returns mostly reasonable results. Of course, there may be many coincidences – there are many men in Munich named Hans Shmidt. But it is useful.

P.S. During the research we encountered PETER. We did not test it because it requires using extproc, which can potentially lead to security problem. But the product looks very promising.

 Posted by at 10:47