SQL Tips by Namwar Rizvi

May 31, 2007

Finding all mispelled string values quickly

Filed under: Information,Query,string manipulation,tips,TSQL — namwar @ 9:10 PM

During data cleansing of imported data from some legacy system, we find that some values are actually spelled incorrectly and therefore, they are not getting included in the result of a query. Finding all the misspelled versions of a given value is quite difficult if you have thousands or millions of records.
Fortunately, there is a quick and easy solution for it which is “SOUNDEX” function in TSQL. Please note that Soundex is not the guarnteed way of finding all the incorrect versions but it is one of the quickest and nearly 90% accurate way of it.
Soundex is an algorithm and it bases on the idea that similar sounding words will have a same alpha-numerical score calculated by this algorithm. So for example, if you have a column called “Color” and you have different variation of same color names like Red,Redd,Redh etc. then Soundex will assign the same score to all of them and you can easily find these variations by comparing the Soundex score. Following is the full working example to better understand this concept.

–Disable SQL Server intermediate messages
Set NoCount On

–Create test Table containing daily data
Declare @m_TestTable table (ItemId int, Color varchar(50))

–Insert some sample values
Insert into @m_TestTable values(1,‘Red’)
Insert into @m_TestTable values(2,‘Reddh’)
Insert into @m_TestTable values(3,‘Redd’)
Insert into @m_TestTable values(4,‘Blue’)
Insert into @m_TestTable values(5,‘Green’)
Insert into @m_TestTable values(6,‘Dark Red’)

—Select all those items which are Red
Select * from @m_TestTable Where Soundex(Color)=Soundex(‘Red’)


