Sanjay has asked for the wisdom of the Perl Monks concerning the following question:

Need to search for similar strings in a large text file - detects distortions, e.g. get lists like "New York", "Newyork", "Neww York", "Few York", "Ewe Pork", etc.

Like to get frequencies: "New York" {67}, "Neww York" {12}, ... assuming 67 occurrences of "New York" in the file

Like to get line numbers where occurring: ... "Neww York" {2 - 12,144} ... assuming this string occurring on lines 12 & 144

Handle matches not starting on word boundary: "Anew York" matches "New York". "Anew York" would probably have its own separate list (can we link across lists?)


1. Minimum length of string should be a parameter.

2. How much distortion to tolerate? i.e. depending on distortion accommodated, "hello" may match "goodbye".

3. Probably would have to be broken down into separate scripts!

4. Flexible on output format