Beefy Boxes and Bandwidth Generously Provided by pair Networks Cowboy Neal with Hat
Perl Monk, Perl Meditation
 
PerlMonks  

Special Variables and Multiple Regexp Matching

by nathaniels (Acolyte)
on Nov 18, 2013 at 00:09 UTC ( #1063022=perlquestion: print w/ replies, xml ) Need Help??
nathaniels has asked for the wisdom of the Perl Monks concerning the following question:

Hello All,

I'm trying to create an array of regexp matches in context. What I have below clearly doesn't work as the special variables are only initialized to the first match. I've tried using a while loop but I keep getting stuck in infinite loops. What's the best way of doing this?

foreach my $line (@lines) { foreach ( lc ($line) =~ /\b $token \b/g ) { $before_string = $`; $after_string = $'; if ( length $before_string > 55 ) { $before_string = substr( $before_string, -55 ); } if ( length $after_string > 55 ) { $after_string = substr( $after_string, 0, 55 ); } push @contexts, [$before_string, $token, $after_string]; } }

Thanks!

UPDATE: Thank you all! I Have learned a lot!

Comment on Special Variables and Multiple Regexp Matching
Download Code
Re: Special Variables and Multiple Regexp Matching
by LanX (Abbot) on Nov 18, 2013 at 00:30 UTC
    I'm not sure what your goal is...

    We have to guess how '$token' and '$line' look like, and I have no idea why the (undemonstrated) while loop should not terminate.

    (see How (Not) To Ask A Question and How do I post a question effectively?)

    But pos will tell you where the last match paused, the rest should be easily calculated with 'length($token)' (IF token is just a simple string)

    DB<127> $str= "a b " x 4 => "a b a b a b a b " DB<128> $token="a" => "a" DB<129> print pos($str),": $1\n" while $str =~ /($token)/g 1: a 7: a 13: a 19: a

    update

    ) well except the fact that lc always creates a new string which resets the search-position. Simply lowercase once '$line' before you loop!

    Cheers Rolf

    ( addicted to the Perl Programming Language)

Re: Special Variables and Multiple Regexp Matching
by GrandFather (Cardinal) on Nov 18, 2013 at 00:39 UTC

    As a general thing providing a focused working stand alone sample script illustrating the problem will get you better answers. Consider:

    use strict; use warnings; my @lines = split /\n/, <<LINES; token token token taken token tiki LINES chomp @lines; for my $line (@lines) { print "$line:\n"; while ($line =~ /token/gi) { print " >$`|$'<\n"; } }

    Prints:

    token token token: >| token token< >token | token< >token token |< taken token tiki: >taken | tiki<

    while (lc($line) =~ /token/g) { loops until killed because each time through the loop lc($line) is re-evaluated so the regular expression restarts, so don't do that. So long as $line isn't changed the /g makes the regular expression carry on from where it left off last time through the loop, which is what you want.

    True laziness is hard work
Re: Special Variables and Multiple Regexp Matching
by Kenosis (Priest) on Nov 18, 2013 at 01:44 UTC

    Your script, as is, produced the expected results with my data. However, using either $` or $' is deprecated, in favor of ${^PREMATCH} and ${^POSTMATCH}, respectively--but you need to use the p option to preserve a copy of the matched string:

    use strict; use warnings; use Data::Dumper; my $token = 'perl'; my @contexts; chomp( my @lines = <DATA> ); foreach my $line (@lines) { while ( $line =~ /\b $token \b/gip ) { my $before_string = ${^PREMATCH}; my $after_string = ${^POSTMATCH}; if ( length $before_string > 5 ) { $before_string = substr( $before_string, -5 ); } if ( length $after_string > 5 ) { $after_string = substr( $after_string, 0, 5 ); } push @contexts, [ $before_string, $token, $after_string ]; } } print Dumper \@contexts; __DATA__ ABCDEFGHIJKLMNOP_PERL_ABCDCEFGHIJKLOMNP 0123456789ABCDEFGHIJKLMNOP_ PERL _ABCDCEFGHIJKLOMN_ PERL _P0123456789 ABCDEFGHIJKLMNOPPERLABCD PERL CEFGHIJKLOMNP 0123456789_ PERL _0123456789

    Output:

    $VAR1 = [ [ 'MNOP_', 'perl', '_ABCD' ], [ 'LOMN_', 'perl', '_P012' ], [ 'LABCD', 'perl', 'CEFGH' ], [ '6789_', 'perl', '_0123' ] ];

    Hope this helps!

      It does! The only problem is that it's still not catching a match that happens twice in a line. So in your sample data, (thank you for supplying some as I clearly neglected to) the output data is missing the second "perl."

        Glad to hear it did!

        Of course, I don't know exactly what you're matching, but did you mean /\b$token\b/ instead of the /\b $token \b/ that you have in your original code, i.e., did you want spaces and word boundaries around the token or just word boundaries?

Re: Special Variables and Multiple Regexp Matching
by kcott (Abbot) on Nov 18, 2013 at 02:13 UTC

    G'day nathaniels,

    I thought I'd draw your attention to the caveats regarding both $` and $'. In perlvar: Variables related to regular expressions, you'll see the same warning for both of these special variables:

    "The use of this variable anywhere in a program imposes a considerable performance penalty on all regular expression matches."

    The documentation shows alternatives. Do note which Perl versions those alternatives are available in: e.g. @- was added in v5.6.0 while ${^PREMATCH} was added in v5.10.0. And, if you wish, you can use Benchmark to compare them.

    -- Ken

Re: Special Variables and Multiple Regexp Matching
by AnomalousMonk (Monsignor) on Nov 18, 2013 at 16:26 UTC

    Guessing a bit at what you mean by "in context", but here's another approach (Update: slightly buggy: see update below):

    >perl -wMstrict -le "my $s = 'a X bar XX c XXX d XXXX e XXXXX foo XXXXXX g'; ;; my @tokens = qw(a bar c d e foo g); my ($token) = map qr{ \b $_ \b }xms, join q{|}, @tokens; my $max = 3; ;; while ($s =~ m{ ( .{0,$max} $token) (?= (.{0,$max})) }xmsg) { print qq{'$1$2'}; } " 'a X ' ' X bar XX' 'XX c XX' 'XX d XX' 'XX e XX' 'XX foo XX' 'XX g'

    Update: This version fixes a bug in the definition of  $token that allowed  'de' to be matched as a token (or part of one, anyway), and also makes match sub-pattern extraction clearer for demonstration purposes.

    >perl -wMstrict -le "my $s = 'a X bar c XXX d XXXX e XXX de XXX foo XXXXXX g h'; ;; my @tokens = qw(a bar c d e foo g h); my ($token) = map qr{ \b (?: $_) \b }xms, join q{|}, @tokens; my $pre_max = 3; my $post_max = 3; my $pre = qr{ .{0,$pre_max}? }xms; ## fixed -- see update below my $post = qr{ .{0,$post_max} }xms; ;; while ($s =~ m{ ($pre) ($token) (?= ($post)) }xmsg) { my ($before, $tok, $after) = ($1, $2, $3); print qq{'$before$tok$after' :$before:$tok:$after:}; } " 'a X ' ::a: X : ' X bar c ' : X :bar: c : ' c XX' : :c: XX: 'XX d XX' :XX :d: XX: 'XX e XX' :XX :e: XX: 'XX foo XX' :XX :foo: XX: 'XX g h' :XX :g: h: ' h' : :h::

    Update: Another bug: The regex
        my $pre = qr{ .{0,$pre_max} }xms;
    misses token 'c' in  'b c' (for tokens 'b' and 'c') if 'c' is within the 'span' of context characters. Replace with
        my $pre = qr{ .{0,$pre_max}? }xms;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1063022]
Approved by muba
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (15)
As of 2014-04-16 16:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (432 votes), past polls