Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

pattern matching (greedy, non-greedy,...)

by cacophony777 (Initiate)
on Dec 17, 2009 at 00:05 UTC ( #813103=perlquestion: print w/ replies, xml ) Need Help??
cacophony777 has asked for the wisdom of the Perl Monks concerning the following question:

Hi!

I've been trying to write a script to generate some useful output based on a log file, and I ran into the issue mentioned in this thread.

http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg507.html

Specifically, the problem I'm trying to solve involves matching a group of log lines.

For example:

KEY blah blahblah KEY blah ah other random stuff KEY blah ha other random stuff PATTERN asdf KEY fdas PATTERN

I want to match each PATTERN, but have the match also include the most recent preceeding KEY (and everything in between). So these are the two matches I'm interested in:

KEY blah ha other random stuff PATTERN

KEY fdas PATTERN

If I do something like KEY.*PATTERN the entire contents get the match, and if I do KEY.*?PATTERN it matches everything from the first KEY to the first PATTERN. I've also tried .*KEY.+?PATTERN which just matches the very last group.

If you've got any insight it would be greatly appreciated.

Thanks.

Comment on pattern matching (greedy, non-greedy,...)
Select or Download Code
Re: pattern matching (greedy, non-greedy,...)
by AnomalousMonk (Monsignor) on Dec 17, 2009 at 00:44 UTC

    Here's an approach to use if you have all the 'log lines' as a single (possibly quite long) string as mention of  KEY.*PATTERN in your OP suggests:

    >perl -wMstrict -le "my $s = 'KE a KE bb ccc KE ddd PA ee KE fff xx PA gg KEPA h'; my $start = qr{ KE }xms; my $not_start = qr{ (?! $start) . }xms; my $stop = qr{ PA }xms; my $chunk = qr{ $start $not_start* $stop }xms; print qq{'$s'}; print map qq{'$_' }, $s =~ m{ $chunk }xmsg; " 'KE a KE bb ccc KE ddd PA ee KE fff xx PA gg KEPA h' 'KE ddd PA' 'KE fff xx PA' 'KEPA'

    This won't work if you are processing the file line-by-line. I'm working on that as, no doubt, are others.

    Update: ... like toolic.

    It should be mentioned that if if you are processing a multi-line file slurp, the  $start and  $stop regexes should be something like  qr{ ^ KE $ }xms and  qr{ ^ PA $ }xms respectively – note the  ^ $ embedded newline anchor metacharacters.

      Wow, thanks!

      Processing the entire file at once is fine for what I'm trying to do.

      Here's what I had written so far (I just started with Perl so be gentle):

      open (IN, 'input.txt') or die "$!"; my $lines = do {local $/; <IN>}; close IN; while ($lines =~ s/Key.+?value=(\d+).+?Screen:add.+?value=(\d+).+?Xml: +sendRequest.+?value=(\d+).+?Xml:onResponse.+?value=(\d+).+?Xml:proces +sing.+?value=(\d+)//s){ # then I would use $1 - $5 }

      I'm not sure yet how to incorporate your solution into what I have, but perhaps I should do some more reading.

      Also, to clarify, the file has multiple lines but the KEY and PATTERN values don't fall on their own line as my original example illustrates. I made it a bit too simplistic. It looks more like:

      BLAH BLAH BLAH KEY blah blah blah BlAH BLAH BLAH ABD KEY blah blah asdf asdf asdf asdf BLAH ASDF PATTERN blah blah

        It looks like you are using a  s/// substitution to repeatedly search from the very start of the string and then snip out already-processed substrings so that you don't encounter them again. It would be so much easier (and faster, if the string/file is huge) to use the  /g modifier on a  m// match and deal with each sub-string as it is found. See Modifiers in perlre, also see perlretut, perlrequick.

        A little whitespace and formatting never hurts, either. See the  /x modifier in the references above.

        Another suggestion is to factor out sub-patterns with a collection of  qr// regex object definitions (see references above). As with code in general, such factoring allows you to better understand and control the final regex. An example of such factoring is in the code of my original reply.

        OTOH, since it looks like you may be trying to parse XML, the best advice might be to not use regexes at all; use one of the many fine XML parser modules from CPAN: see XML::Parse (others will be better able than I to advise you on this).

        ... the file has multiple lines but the KEY and PATTERN values don't fall on their own line as my original example illustrates.

        No matter. Just don't use the  ^ $ embedded newline anchors at the beginnings and ends of your start and stop patterns. (Of course, they can still be used elsewhere.) See discussions of the m regex modifier (Modifiers) in perlre and other cited refs. The example string in Re: pattern matching (greedy, non-greedy,...) has no newlines in it at all, anywhere!

        Update: Oops. This reply would have been better as an update to Re^3: pattern matching (greedy, non-greedy,...).
Re: pattern matching (greedy, non-greedy,...)
by toolic (Chancellor) on Dec 17, 2009 at 00:51 UTC
    Here is a solution which avoids the greediness issue by using a state variable and a line buffer:
    use warnings; use strict; my $flag = 0; my @lines; while (<DATA>) { if (/KEY/) { @lines = (); $flag = 1; } if ($flag) { push @lines, $_; if (/PATTERN/) { print @lines; $flag = 0; } } } __DATA__ KEY blah blahblah KEY blah ah other random stuff KEY blah ha other random stuff PATTERN asdf KEY fdas PATTERN
    prints:
    KEY blah ha other random stuff PATTERN KEY fdas PATTERN
      I was gonna suggest using the range op, but it doesn't really help
      my @lines; while (<DATA>) { my $is_key = /KEY/; @lines = () if $is_key; if (my $in_range = $is_key .. /PATTERN/) { push @lines, $_; print @lines if $in_range =~ /E0\z/; } }

      If one continues to simplify the above, one gets the parent's code.

Re: pattern matching (greedy, non-greedy,...)
by Skeeve (Vicar) on Dec 17, 2009 at 09:57 UTC

    I managed to use the range operator for this:

    #!/bin/env perl use strict; use warnings; my $start= qr/^KEY$/; my $end = qr/^PATTERN$/; my $match; while (<DATA>) { if (my $hit= /$start/ .. /$end/) { $match= '' if /$start/; $match.= $_; print $match if $hit ne 0+$hit; } } __DATA__ KEY blah blahblah KEY blah ah other random stuff KEY blah ha other random stuff PATTERN asdf KEY fdas PATTERN

    I simply reset my buffer when I find the start a second time.


    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
Re: pattern matching (greedy, non-greedy,...)
by AnomalousMonk (Monsignor) on Dec 17, 2009 at 15:21 UTC

    Here's another thought on the problem. This processes on a buffered line-by-line basis and takes account of the fact that start and stop patterns may appear anywhere. The buffer is prevented from growing without limit while searching for a start pattern, but while searching for a stop pattern, the remainder of the file will be consumed if one is not found.

    A line break may appear within a start or stop pattern as long as it can be clearly specified within the pattern; a line break cannot appear at random within a pattern. If such random line breaks (really, record separators, which are usually newlines) appear, the only way I can think of to deal with them is to delete them, e.g., with a chomp, and treat the whole file as a single, unbroken line.

    The start and stop regexes in the code example are defined using literal character strings, but they may be any regex. Note the definition of the full-pattern regex has changed from
        qr{ $start $not_start* $stop }xms
    to
        qr{ $start $not_start*? $stop }xms
    (addition of  ? lazy quantifier modifier) to prevent the regex including multiple stop sequences, should they be present.

    Output:

    'STARTblah haother random stuffSTOP' 'STARTSTOP' 'STARTfdasSTOP' 'START yesSTOP' 'START ohyesSTOP' 'STARTSTOP' 'STARTyes yes yes STOP' 'STARTSTOP' 'START yes STOP' 'START yes1 STOP' 'START yes2 STOP' 'START yes3 STOP' 'START yes4yes5 STOP'

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://813103]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (12)
As of 2014-07-31 18:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (250 votes), past polls