Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Mass Text Replacement

by tedv (Pilgrim)
on Apr 25, 2001 at 04:17 UTC ( #75326=perlquestion: print w/ replies, xml ) Need Help??
tedv has asked for the wisdom of the Perl Monks concerning the following question:

Suppose I have an input file that looks like this:
This is just lines of text here, and also there. Consider this human readable text; it's full of letters and punctuation.
The input file could be several hundred lines. Now also suppose I have a hash table containing entries like:
my $table = { "lines_of_text" => "foo.html", "this" => "bar.html", "its_full" => "foobar.html", }
There could potentially be maybe 5000 entries in this table. For each entry in the hash table, we want to find the first segment of input data that could map to the key and replace it with appropriate html. So this:
... just lines of text ...
turns into this:
... just <a href="foo.html">lines of text</a> ...
Of course, we link the initial This but not the this starting line 2.

I only see two ways of solving this problem, and both of them are extremely inelligant. I could either write a massive regular expression, or'ing together all of the 5000 keys, or I could search through the file one letter at a time and see if any keys started at that point. Clearly both of these solutions are unacceptable.

The trickiness in this problem comes from the fact that the hash table needs to convert many possible input formats. For example, if I had the key "abc", I should translate ABC AB'C and A,B'C but not A B C. I could apply some massive substitution to the input data set, but because some letters are deleted (like comma and apostrophy), there isn't an easy translation between character index in the original data set and character index in the translated data set.

What ideas do other Monks have for solving this problem?

-Ted

Comment on Mass Text Replacement
Select or Download Code
(Ovid - regexes as hash keys)Re: Mass Text Replacement
by Ovid (Cardinal) on Apr 25, 2001 at 04:41 UTC
    I'm not sure I like the following solution, but how about turning your hash keys into regular expressions?
    my $data = <<EOT; This is just lines of text here, and also there. Consider this human readable text; it's full of letters and punctuation. EOT my %table = ( '\b[Ll]ines of text\b' => "foo.html", '\b[Tt]his\b' => "bar.html", '\b[Ii]t\'?s full\b' => "foobar.html" ); $data =~ s/($_)/<a href="$table{$_}">$1<\/a>/g foreach keys %table; print $data;
    The above seems to work, but if you have 5000+ keys, that's going to take a huge amount of time to code and run about as fast as a one-legged dog. There's also the problem that it's easy to write inefficient regexes.

    The other idea that comes to mind is using the String::Approx module to try to match the keys to text. The problem with that is that it will be slower and more error prone :(

    tedv wrote:

    Of course, we link the initial This but not the this starting line 2.
    Why "of course"? Can you explicitly state a rule? Are you only trying to match the first occurrence of each string? If so, take off the /g on the substitution and it will work much faster.

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Re: Mass Text Replacement
by dws (Chancellor) on Apr 25, 2001 at 04:54 UTC
    A bit of requirements engineering, lest we propose solutions that won't work given hitherto unstated requirements.

    If your table contains keys "foo" and "foo_bar", how should the input string "foo bar" be handled? If "foo" is matched and replaced first, then the "foo_bar" key will never match. Is this an issue? If so, how should this case be handled? Should the solution attempt to match the longest possible string first (i.e., lookahead past "foo")?

    Likewise, if your table contains keys "foo_bar" and "bar_baz", how should the input string "foo bar baz" be handled? If "foo_bar" is matched and replaced first, then "bar_baz" won't match. Is this an issue? If so, how should this case be handled?

Re (tilly) 1: Mass Text Replacement
by tilly (Archbishop) on Apr 25, 2001 at 04:56 UTC
    I have some hope that the RE engine will be able to do this optimization, but you can create a RE as I did at RE (tilly) 4: SAS log scanner that will match any key, and then do a hash lookup for the substitution. (Make appropriate modifications to allow case insensitivity, and to allow arbitrary punctuation and whitespace where the _ appears. The code there will not work exactly as shown.)

    However that way lies insanity. (Trust me. It does.)

    If you can, you really want to introduce a standard templating solution early and go with that. See, for instance, Template::Toolkit or Text::Template.

Re: Mass Text Replacement
by clintp (Curate) on Apr 25, 2001 at 04:57 UTC
    I've got an answer, but something in your description is puzzling:
    For example, if I had the key "abc", I should translate ABC AB'C and A,B'C but not A B C.
    I dunna quite understand this bit. Your sample keys are things like "lines_of_text". Are you saying that you want to match "lines,of'text" but not "lines of text"?

    The thing that springs to mind is two-parts. First, build a hash keyed like you have it above, with the proper substitution. At the same time, build a regular expression that encompasses all of your possible keys (yes, with lots of |'s). If you clarify what you're trying to match with a good example, it can probably be simplified greatly.

    Use parenthesis with the substituion operator and a /e switch. Like:

    s/(f\W*o\W*o|b\W*a\W*r|b\W*a\W*z|b\W*a\W*t) /($a=$1)=~s/\W+//g,$table{$a}/xe
    So this would match variations of "foo", "bar", etc... but look them up in %table by the original (without the non-word character) names.

    If speed is a problem, consider using eval instead of a regular expression with alternations.

Re: Mass Text Replacement
by chipmunk (Parson) on Apr 25, 2001 at 06:02 UTC
    Here's a solution I came up with. I don't know that it's elegant, and I'm not sure how efficient it is. It gets the job done, though, so you may find it helpful.

    The code works by checking each word in the string, and seeing if it could be the start of a match. In cases where the same word could start several matches, the longer matches are tried first. When a match is found, the link is inserted, and the search continues at the next word after the new link.

    Enjoy!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://75326]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (8)
As of 2014-11-27 10:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (183 votes), past polls