Great article. I was wondering if you could offer any perspective into the relative performance hits of adding each technique you mention. Will any of them wreak havoc on a system when processing a large data set?
Off the top of my head – not at the moment. But I would think the baseline name – to – name operation is somewhat analagous to a poor sorting algorithm in terms of efficiency. Nearly every matching algorithm will somewhat result in the same poor performance in terms of speed.
The only way I can think to beat it would be to use clustered indices which would at least guarantee – for example – that trying to match the name smith would at least jump within the second list to the appropriate cells right away as opposed to performing a complete table scan.
If the operation were taking place with CURSORS then you could code it somewhat like you would in normal programming – i.e. looping only through string beginning with “S” for instance. But it’s still dependent on having the comparison list sorted.
The real problem occurs if you don’t mitigate “fluff matches”. For example, lets say you’re using SOUNDEX and 2 @Smith@s in the source list and have 5 @Smith@s, 1 Smythe and 1 more Smyth in the comparison list. Both of the original @Smith@s will match up to every other Smith, Smythe and Smyth resulting in 14 matches as opposed to just two. The point of this rambling is that as the list becomes bigger, the “fluff matches” increase which could dramatically increase the data requirement to store the results.
RE: Performance Context by VnutZ :: NR10 :: Show
Off the top of my head – not at the moment. But I would think the baseline name – to – name operation is somewhat analagous to a poor sorting algorithm in terms of efficiency. Nearly every matching algorithm will somewhat result in the same poor performance in terms of speed.
The only way I can think to beat it would be to use clustered indices which would at least guarantee – for example – that trying to match the name
smithwould at least jump within the second list to the appropriate cells right away as opposed to performing a complete table scan.If the operation were taking place with CURSORS then you could code it somewhat like you would in normal programming – i.e. looping only through string beginning with “S” for instance. But it’s still dependent on having the comparison list sorted.
The real problem occurs if you don’t mitigate “fluff matches”. For example, lets say you’re using SOUNDEX and 2 @Smith@s in the source list and have 5 @Smith@s, 1
Smytheand 1 moreSmythin the comparison list. Both of the original @Smith@s will match up to every otherSmith,SmytheandSmythresulting in 14 matches as opposed to just two. The point of this rambling is that as the list becomes bigger, the “fluff matches” increase which could dramatically increase the data requirement to store the results.