Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Special Variables and Multiple Regexp Matching

by nathaniels (Acolyte)
on Nov 18, 2013 at 00:09 UTC ( #1063022=perlquestion: print w/replies, xml ) Need Help??
nathaniels has asked for the wisdom of the Perl Monks concerning the following question:

Hello All,

I'm trying to create an array of regexp matches in context. What I have below clearly doesn't work as the special variables are only initialized to the first match. I've tried using a while loop but I keep getting stuck in infinite loops. What's the best way of doing this?

foreach my $line (@lines) { foreach ( lc ($line) =~ /\b $token \b/g ) { $before_string = $`; $after_string = $'; if ( length $before_string > 55 ) { $before_string = substr( $before_string, -55 ); } if ( length $after_string > 55 ) { $after_string = substr( $after_string, 0, 55 ); } push @contexts, [$before_string, $token, $after_string]; } }


UPDATE: Thank you all! I Have learned a lot!

Replies are listed 'Best First'.
Re: Special Variables and Multiple Regexp Matching
by GrandFather (Sage) on Nov 18, 2013 at 00:39 UTC

    As a general thing providing a focused working stand alone sample script illustrating the problem will get you better answers. Consider:

    use strict; use warnings; my @lines = split /\n/, <<LINES; token token token taken token tiki LINES chomp @lines; for my $line (@lines) { print "$line:\n"; while ($line =~ /token/gi) { print " >$`|$'<\n"; } }


    token token token: >| token token< >token | token< >token token |< taken token tiki: >taken | tiki<

    while (lc($line) =~ /token/g) { loops until killed because each time through the loop lc($line) is re-evaluated so the regular expression restarts, so don't do that. So long as $line isn't changed the /g makes the regular expression carry on from where it left off last time through the loop, which is what you want.

    True laziness is hard work
Re: Special Variables and Multiple Regexp Matching
by Kenosis (Priest) on Nov 18, 2013 at 01:44 UTC

    Your script, as is, produced the expected results with my data. However, using either $` or $' is deprecated, in favor of ${^PREMATCH} and ${^POSTMATCH}, respectively--but you need to use the p option to preserve a copy of the matched string:

    use strict; use warnings; use Data::Dumper; my $token = 'perl'; my @contexts; chomp( my @lines = <DATA> ); foreach my $line (@lines) { while ( $line =~ /\b $token \b/gip ) { my $before_string = ${^PREMATCH}; my $after_string = ${^POSTMATCH}; if ( length $before_string > 5 ) { $before_string = substr( $before_string, -5 ); } if ( length $after_string > 5 ) { $after_string = substr( $after_string, 0, 5 ); } push @contexts, [ $before_string, $token, $after_string ]; } } print Dumper \@contexts; __DATA__ ABCDEFGHIJKLMNOP_PERL_ABCDCEFGHIJKLOMNP 0123456789ABCDEFGHIJKLMNOP_ PERL _ABCDCEFGHIJKLOMN_ PERL _P0123456789 ABCDEFGHIJKLMNOPPERLABCD PERL CEFGHIJKLOMNP 0123456789_ PERL _0123456789


    $VAR1 = [ [ 'MNOP_', 'perl', '_ABCD' ], [ 'LOMN_', 'perl', '_P012' ], [ 'LABCD', 'perl', 'CEFGH' ], [ '6789_', 'perl', '_0123' ] ];

    Hope this helps!

      It does! The only problem is that it's still not catching a match that happens twice in a line. So in your sample data, (thank you for supplying some as I clearly neglected to) the output data is missing the second "perl."

        Glad to hear it did!

        Of course, I don't know exactly what you're matching, but did you mean /\b$token\b/ instead of the /\b $token \b/ that you have in your original code, i.e., did you want spaces and word boundaries around the token or just word boundaries?

Re: Special Variables and Multiple Regexp Matching
by kcott (Chancellor) on Nov 18, 2013 at 02:13 UTC

    G'day nathaniels,

    I thought I'd draw your attention to the caveats regarding both $` and $'. In perlvar: Variables related to regular expressions, you'll see the same warning for both of these special variables:

    "The use of this variable anywhere in a program imposes a considerable performance penalty on all regular expression matches."

    The documentation shows alternatives. Do note which Perl versions those alternatives are available in: e.g. @- was added in v5.6.0 while ${^PREMATCH} was added in v5.10.0. And, if you wish, you can use Benchmark to compare them.

    -- Ken

Re: Special Variables and Multiple Regexp Matching
by LanX (Bishop) on Nov 18, 2013 at 00:30 UTC
    I'm not sure what your goal is...

    We have to guess how '$token' and '$line' look like, and I have no idea why the (undemonstrated) while loop should not terminate.

    (see How (Not) To Ask A Question and How do I post a question effectively?)

    But pos will tell you where the last match paused, the rest should be easily calculated with 'length($token)' (IF token is just a simple string)

    DB<127> $str= "a b " x 4 => "a b a b a b a b " DB<128> $token="a" => "a" DB<129> print pos($str),": $1\n" while $str =~ /($token)/g 1: a 7: a 13: a 19: a


    ) well except the fact that lc always creates a new string which resets the search-position. Simply lowercase once '$line' before you loop!

    Cheers Rolf

    ( addicted to the Perl Programming Language)

Re: Special Variables and Multiple Regexp Matching
by AnomalousMonk (Chancellor) on Nov 18, 2013 at 16:26 UTC

    Guessing a bit at what you mean by "in context", but here's another approach (Update: slightly buggy: see update below):

    >perl -wMstrict -le "my $s = 'a X bar XX c XXX d XXXX e XXXXX foo XXXXXX g'; ;; my @tokens = qw(a bar c d e foo g); my ($token) = map qr{ \b $_ \b }xms, join q{|}, @tokens; my $max = 3; ;; while ($s =~ m{ ( .{0,$max} $token) (?= (.{0,$max})) }xmsg) { print qq{'$1$2'}; } " 'a X ' ' X bar XX' 'XX c XX' 'XX d XX' 'XX e XX' 'XX foo XX' 'XX g'

    Update: This version fixes a bug in the definition of  $token that allowed  'de' to be matched as a token (or part of one, anyway), and also makes match sub-pattern extraction clearer for demonstration purposes.

    >perl -wMstrict -le "my $s = 'a X bar c XXX d XXXX e XXX de XXX foo XXXXXX g h'; ;; my @tokens = qw(a bar c d e foo g h); my ($token) = map qr{ \b (?: $_) \b }xms, join q{|}, @tokens; my $pre_max = 3; my $post_max = 3; my $pre = qr{ .{0,$pre_max}? }xms; ## fixed -- see update below my $post = qr{ .{0,$post_max} }xms; ;; while ($s =~ m{ ($pre) ($token) (?= ($post)) }xmsg) { my ($before, $tok, $after) = ($1, $2, $3); print qq{'$before$tok$after' :$before:$tok:$after:}; } " 'a X ' ::a: X : ' X bar c ' : X :bar: c : ' c XX' : :c: XX: 'XX d XX' :XX :d: XX: 'XX e XX' :XX :e: XX: 'XX foo XX' :XX :foo: XX: 'XX g h' :XX :g: h: ' h' : :h::

    Update: Another bug: The regex
        my $pre = qr{ .{0,$pre_max} }xms;
    misses token 'c' in  'b c' (for tokens 'b' and 'c') if 'c' is within the 'span' of context characters. Replace with
        my $pre = qr{ .{0,$pre_max}? }xms;

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1063022]
Approved by muba
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2018-06-18 04:40 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (107 votes). Check out past polls.