We have the following task: Table1 contains the clients of the biggest grocery store in Munich, and Table2 all the fans of Bayern (Munich). The question is how many Bayern fans visit the grocery.
Let’s assume we have the columns First Name (FNAME), Last Name (LNAME), CITY, COUNTRY.
The easiest solution is:
select * from table2 t2 where exists (select 1 from table1 t1 where t1.fname = t2.fname and t1.lname = t2.lname and t1.city = t2.city and t1.country = t2.country)
But this will give us the result if all names are written correctly. Which, unfortunately, happens rarely. For example Mr. Schmitt from the fans may be written as Schmidt in the grocery; Mrs Pavlovich may be Pavlovic; John Smith is not the same as John Smit, etc. We need something better.
There is one SQL function which returns a character string containing the phonetic representation a string. The function is
soundex. The algorithm behind it is relatively simple and can be found here. Let’s try it:
select * from table2 t2 where exists (select 1 from table1 t1 where soundex(t1.fname) = soundex(t2.fname) and soundex(t1.lname) = soundex(t2.lname) and t1.city = t2.city and t1.country = t2.country)
(to make the things simple, i have assumed the CITY and COUNTRY are written fine. We can use the same transformation on them)
But this gives us too many false positives. I do not think Marko Mjagkov is the same as Marica Moisejevs, but soundex returns the same – M620 M221. We need something else
There is one nice, relatively new package, named utl_match. In this package there are functions using two different algorithms facing this problem. The first one is developed in Russia by Vladimir Levenshtein in 1965 (the function EDIT_DISTANCE). The other seems to be developed in US, by Matthew Jaro and William Winkler, published in 1990. Let’s see what we can do using the latter:
select * from table2 t2 where exists (select 1 from table1 t1 where utl_match.jaro_winkler(t1.fname, t2.fname) > 0.9 and utl_match.jaro_winkler(t1.lname, t2.lname) > 0.9 and t1.city = t2.city and t1.country = t2.country)
Hm, we have another problem – there is no effective way to speedup the query. If the grocery has 10 000 clients in Munich, and the football cub has 500 000 fans in the city, we will execute the function utl_match.jaro_winkler more than 5 000 000 000 times. It will, for example, try to compare Adolf with Gretta, which is nonsense. One cannot index this, or use another infrastructure tricks – only raw computing power will help. So it takes time.
And so, I decided to combine both approaches and came up with this:
select * from table2 t2 where exists (select 1 from table1 t1 where soundex(t1.fname) = soundex(t2.fname) and soundex(t1.lname) = soundex(t2.lname) and utl_match.jaro_winkler(t1.fname, t2.fname) > 0.9 and utl_match.jaro_winkler(t1.lname, t2.lname) > 0.9 and t1.city = t2.city and t1.country = t2.country)
soundex limits the number of checks, although there may be many false positives. One can even do a function-based index on Soundex, or a MView.
utl_match.jaro_winkler gives us the most “similar” names. Believe it or not, this works fine – on tables with 20К and 2M rows (all from the same city), it completed in tens of seconds and returns mostly reasonable results. Of course, there may be many coincidences – there are many men in Munich named Hans Shmidt. But it is useful.