Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Thanks for the additional detail.

Is it the intention that each of these substitutions replaces one word with another word? Because the use of .* in many of the patterns means that's not what is actually happening. For example it looks like the intention is to replace the text "one two coworker three four" with the text "one two work three four", but it will actually be replaced with "one work " because the pattern \s.*work.* will match from the first space to the end of the line.

Assuming that the intention is to replace one word with another word, that could look something like this:

# substitute whole word only my %w1 = qw{ going go getting get goes go knew know trying try tried try told tell coming come saying say men man women woman took take lying lie dying die made make }; # substitute on prefix my %w2 = qw{ need need talk talk tak take used use using use }; # substitute on substring my %w3 = qw{ mean mean work work read read allow allow gave give bought buy want want hear hear came come destr destroy paid pay selve self cities city fight fight creat create makin make includ include }; my $re1 = qr{\b(@{[ join '|', reverse sort keys %w1 ]})\b}i; my $re2 = qr{\b(@{[ join '|', reverse sort keys %w2 ]})\w*}i; my $re3 = qr{\w*?(@{[ join '|', reverse sort keys %w3 ]})\w*}i; # then in the loop s/[[:punct:]]/ /g; tr/[0-9]//d; s/w(as|ere)/be/gi; s{$re1}{ $w1{lc $1} }g; s{$re2}{ $w2{lc $1} }g; s{$re3}{ $w3{lc $1} }g; print $OUT "$_\n";

If the input is always ASCII, the initial cleanup for punctuation and digits could potentially be something like s/[^a-z ]/ /gi or equivalently tr/a-zA-Z / /cs, unless you specifically wanted to replace "ABC123D" with the single word "ABCD" rather than the two words "ABC D". However if it may be Unicode, you would instead need something like s/[^\w ]/ /g, with no tr equivalent.

The standalone substitution for w(as|ere) should probably be two additional entries in one of the existing hashes: currently this substitution is unique in replace a substring with another substring, so for example it will change "showered" into "shobed".

It will also help a bit to move the close $IN out of the loop (though it doesn't actually seem to cause a noticeable slowdown).

The above code runs for me about five times faster than your example perl code, though as described it behaves quite differently.


In reply to Re^3: Need to speed up many regex substitutions and somehow make them a here-doc list by hv
in thread Need to speed up many regex substitutions and somehow make them a here-doc list by xnous

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (4)
As of 2024-04-19 23:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found