http://www.perlmonks.org?node_id=75326

tedv has asked for the wisdom of the Perl Monks concerning the following question:

Suppose I have an input file that looks like this:
This is just lines of text here, and also there. Consider this human readable text; it's full of letters and punctuation.
The input file could be several hundred lines. Now also suppose I have a hash table containing entries like:
my $table = { "lines_of_text" => "foo.html", "this" => "bar.html", "its_full" => "foobar.html", }
There could potentially be maybe 5000 entries in this table. For each entry in the hash table, we want to find the first segment of input data that could map to the key and replace it with appropriate html. So this:
... just lines of text ...
turns into this:
... just <a href="foo.html">lines of text</a> ...
Of course, we link the initial This but not the this starting line 2.

I only see two ways of solving this problem, and both of them are extremely inelligant. I could either write a massive regular expression, or'ing together all of the 5000 keys, or I could search through the file one letter at a time and see if any keys started at that point. Clearly both of these solutions are unacceptable.

The trickiness in this problem comes from the fact that the hash table needs to convert many possible input formats. For example, if I had the key "abc", I should translate ABC AB'C and A,B'C but not A B C. I could apply some massive substitution to the input data set, but because some letters are deleted (like comma and apostrophy), there isn't an easy translation between character index in the original data set and character index in the translated data set.

What ideas do other Monks have for solving this problem?

-Ted