Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

regex: extract multiple number of date patterns from certain lines

by Random_Walk (Parson)
on Mar 04, 2009 at 15:49 UTC ( #748204=perlquestion: print w/ replies, xml ) Need Help??
Random_Walk has asked for the wisdom of the Perl Monks concerning the following question:

I have a large and busy log file, once or twice a day a line will pop up looking something like one of these:

2009-02-02 06:12:57,500 dates processed: 2009-01-31, 2009-01-29, 2009- +01-30 2009-02-18 06:03:47,713 dates processed: 2009-02-16, 2009-02-17 2009-02-19 05:58:29,138 dates processed: 2009-02-18

I need to extract the all occurrences of the date pattern /\d{4}-\d{2}-\d{2}/ but only when the line also contains 'dates processed'. The trailing list of dates is of variable length but there will always be at least one.

Due to the log monitoring tool we have I need to grab this in a single regex. The log file is busy on a busy production server so some degree of efficiency is desired. Up until now even before optimising I have not got it to work, here are (some of) my attempts so far:

@res =$_ =~/(\d{4}-\d\d-\d\d)/g # gets them all but also from lines without 'dates processed' @res = $_ =~/(\d{4}-\d\d-\d\d).*dates processed: (\d{4}-\d\d-\d\d,? ?) +*/g # only returns first of trailing list @res = $_ =~/(\d{4}-\d\d-\d\d).*dates processed: ((\d{4}-\d\d-\d\d)*)/ # gets the right number of results but the final list all the same val +ue!

Update

@res = $_ =~/(\d{4}-\d\d-\d\d).*dates processed: ((\d{4}-\d\d-\d\d,? ? +)*) # almost gets it but now have one too many results in the tail # due to the nested braces

Update 2

highlighted that the cunning part here is a single regex is needed, a few kind souls are missing this

</Updates>

Can any of the regex gurus out there give me a hint please

Thanks,
R.

Pereant, qui ante nos nostra dixerunt!

Comment on regex: extract multiple number of date patterns from certain lines
Select or Download Code
Re: regex: extract multiple number of date patterns from certain lines
by Anonymous Monk on Mar 04, 2009 at 15:54 UTC
    There is comma followed by digits before "dates processed"
    2009-02-19 05:58:29,138 dates processed: 2009-02-18 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ dddd-dd-dd dd:dd:dd,ddd : dddd-dd-dd ^^^^ ||||

      That's no worry as I throw away the time part anyway, currently with a .* for brevity but in prod code I'll add in \d\d:\d\d:\d\d,\d{3} to match that off to oblivion

      Cheers,
      R.

      Pereant, qui ante nos nostra dixerunt!
Re: regex: extract multiple number of date patterns from certain lines
by Bloodnok (Vicar) on Mar 04, 2009 at 15:58 UTC
    while (<DATA>) { @res = /(\d{4}(?:-\d\d){2}).*dates processed: (.*)/; warn "@res"; } __DATA__ 2009-02-02 06:12:57,500 dates processed: 2009-01-31, 2009-01-29, 2009- +01-30 2009-02-18 06:03:47,713 dates processed: 2009-02-16, 2009-02-17 2009-02-19 05:58:29,138 dates processed: 2009-02-18
    returns
    2009-02-02 2009-01-31, 2009-01-29, 2009-01-30 at tst.pl line 3, <DATA> + line 1. 2009-02-18 2009-02-16, 2009-02-17 at tst.pl line 3, <DATA> line 2. 2009-02-19 2009-02-18 at tst.pl line 3, <DATA> line 3.

    Update:

    Following a change in requirmeents ;-) ...

    use Data::Dumper; while (<DATA>) { @res = map { split } /(\d{4}(?:-\d\d){2}).*dates processed: (.*)/; warn Dumper \@res; } __DATA__ 2009-02-02 06:12:57,500 dates processed: 2009-01-31, 2009-01-29, 2009- +01-30 2009-02-18 06:03:47,713 dates processed: 2009-02-16, 2009-02-17 2009-02-19 05:58:29,138 dates processed: 2009-02-18
    returns
    $VAR1 = [ '2009-02-02', '2009-01-31,', '2009-01-29,', '2009-01-30' ]; $VAR1 = [ '2009-02-18', '2009-02-16,', '2009-02-17' ]; $VAR1 = [ '2009-02-19', '2009-02-18' ];
    as required (nearly:-D) ??

    A user level that continues to overstate my experience :-))

      Nice but I did want to bust all the date values out to separate elements of the @res array. Your non capturing braces though give me the clue I think to fix it properly, but I still can't quite get it:

      @res = $_ =~/(\d{4}-\d\d-\d\d).*dates processed: ((:?\d{4}-\d\d-\d\d,? + ?)*)/ # still captures last result twice # input # 2009-02-02 06:12:57,500 dates processed: 2009-01-31, 2009-01-29, 200 +9-01-30 # output # 2009-02-02, 2009-01-31, 2009-01-29, 2009-01-30, 2009-01-30

      Update

      Oeps, I am not splitting them with the above either, fooled myself because my debug testing printed the list out with a join ", ", doh!

      Cheers,
      R.

      Pereant, qui ante nos nostra dixerunt!
Re: regex: extract multiple number of date patterns from certain lines
by Anonymous Monk on Mar 04, 2009 at 16:00 UTC
    given your example input, what exactly should @res contain?

      all the date values from the line, preferably each as a separate element of the array and in the same order they occurred on the line

      2009-02-02 06:12:57,500 dates processed: 2009-01-31, 2009-01-29, 2009- +01-30 @res = (2009-02-02, 2009-01-31, 2009-01-29, 2009-01-30) 2009-02-18 06:03:47,713 dates processed: 2009-02-16, 2009-02-17 @res = (2009-02-18, 2009-02-16, 2009-02-17) 2009-02-19 05:58:29,138 dates processed: 2009-02-18 @res = (2009-02-19, 2009-02-18)

      Cheers,
      R.

      Pereant, qui ante nos nostra dixerunt!
Re: regex: extract multiple number of date patterns from certain lines
by ikegami (Pope) on Mar 04, 2009 at 16:09 UTC
    The match operator is not particularly well suited to extract this data since the data has two dimensions. One solution:
    while ( / ^ (\d{4}-\d\d-\d\d) .*dates processed:[ ] ( (?:\d{4}-\d\d-\d\d,[ ])* \d{4}-\d\d-\d\d ) $ /mg ) { my $on = $1; my $procesed = $2; my @processed = split(/, /, $processed); # Do something with $on and @processed. }

    Or if you are dealing with a file handle,

    while (<$fh>) { my ($on, $processed) = / ^ (\d{4}-\d\d-\d\d) .*dates processed:[ ] ( (?:\d{4}-\d\d-\d\d,[ ])* \d{4}-\d\d-\d\d ) $ / or next; my @processed = split(/, /, $processed); # Do something with $on and @processed. }

    Update: Added file handle version since that's probably what the OP really wants.

      The match operator is not particularly well suited to extract this data since the data has two dimensions.

      Due to the different structure of captures in Perl 6 regexes that doesn't hold true for Perl 6 anymore. Here's a Perl 6 solution that extracts all trailing dates with one regex match:

      use v6; my $str = '2009-02-02 06:12:57,500 dates processed: 2009-01-31, 2009-01-29, 2009 +-01-30 2009-02-18 06:03:47,713 dates processed: 2009-02-16, 2009-02-17 2009-02-19 05:58:29,138 dates processed: 2009-02-18 '; token date { \d**4 '-' \d**2 '-' \d ** 2 }; regex line { ^^ \N* 'processed:' \s* <date> [','\s* <date>]* \s* \n } +; if $str ~~ m/ ^ <line>+ / { for $<line> -> $l { print "Dates in line $l"; .say for $l<date>; } } else { say "no match"; }

      (tested on Rakudo).

        Very nice.

        Unfortunately I only get to upgrade production to 5.10 in two weeks time, 6 is going to have to wait a few more weeks I guess.

        Cheers,
        R.

        Pereant, qui ante nos nostra dixerunt!

      Good $localtime ikegami++ sir,

      I am actually dealing with an existing code base that uses POE::Wheel::FollowTail and checks each new log line against a list of pre-compiled regex patterns, hence the desire to do it in a single regex. When it finds a match it calls the forwarder method on the object that is associated with the matching pattern.

      Among other refs passed to the forwarding object is one to a list of matches from the regex normally saving having to split it up all over again. I do get a second bite at the cherry in the objects forwarder method. It would have been nice though after matching all those dates if I could just pass them all through already separated.

      Thanks for looking, at least I now know it is not me making a trivial error

      Cheers,
      R.

      Pereant, qui ante nos nostra dixerunt!
        I don't see anything in POE::Wheel::FollowTail about regexps, so I presume it's not a limitation of that module. Why can't your check list contains both regexps and code refs?
Re: regex: extract multiple number of date patterns from certain lines
by johngg (Abbot) on Mar 04, 2009 at 16:48 UTC

    Does this code produce the results you need? For any line that doesn't match "dates processed:" it pushes an empty array reference onto the @results array just so that you can tell there was a line that didn't match.

    use strict; use warnings; use Data::Dumper; open my $logFH, q{<}, \ <<'EOD' or die qq{open: << HEREDOC: $!\n}; 2009-02-02 06:12:57,500 dates processed: 2009-01-31, 2009-01-29, 2009- +01-30 2009-02-18 06:03:47,713 dates processed: 2009-02-16, 2009-02-17 Different line here 2009-02-19 05:58:29,138 dates processed: 2009-02-18 EOD my @results = (); while( <$logFH> ) { chomp; push( @results, [] ), next unless m{dates processed:}; my @dates = m{(\d{4}-\d\d-\d\d)}g; push @results, \ @dates; } close $logFH or die qq{close: << HEREDOC: $!\n}; print Data::Dumper->Dumpxs( [ \ @results ], [ qw{ *results } ] );

    The output.

    @results = ( [ '2009-02-02', '2009-01-31', '2009-01-29', '2009-01-30' ], [ '2009-02-18', '2009-02-16', '2009-02-17' ], [], [ '2009-02-19', '2009-02-18' ] );

    I hope this is useful to you.

    Cheers,

    JohnGG

Re: regex: extract multiple number of date patterns from certain lines
by Marshall (Prior) on Mar 04, 2009 at 19:00 UTC
    I don't know what this first date is or whether you need it. I called it $stamp. I think this does what you want. If you don't need $stamp, then just assign it to undef.
    #!/usr/bin/perl -w use strict; while (<DATA>) { next if (!/dates processed/); my ($stamp, @dates) = ($_ =~ /(\d+-\d+-\d+)/g); # or my (undef, @dates) = ($_ =~ /(\d+-\d+-\d+)/g); # of course if that is what you want then change # the following line too! print "stamp=$stamp, dates are: @dates","\n"; } #prints...... #stamp=2009-02-19, dates are: 2009-01-31 2009-01-29 2009-01-30 #stamp=2009-02-18, dates are: 2009-02-16 2009-02-17 #stamp=2009-02-19, dates are: 2009-02-18 __DATA__ 2009-02-19 06:03:47,713 SOMETHING WRONG: 2009-01-33, 2009-01-44, 2009- +01-33 2009-02-19 05 58 29 138 dates processed: 2009-01-31, 2009-01-29, 2009- +01-30 2009-02-18 06:03:47,713 dates processed: 2009-02-16, 2009-02-17 2009-02-19 05:58:29,138 dates processed: 2009-02-18

      Hi Marshall

      I do need the initial date, this is for of a logfile parser that captures these dates and sends them up the line to a monitoring application that then compares the timestamp date to the processed dates and raises an alarm a processed dates was too old.

      The crux of the matter is that my log file parser has one shot at each log line with a regex and the matched parts are then passed up to the next stage, I want to get as much done in the regex as possible/reasonable, partly on the principle of keeping monitoring close to the monitored and partly for pure bloody minded IT geek fun.

      sadly the main constraint of this problem is one line of regex, code is cheating!

      Cheers,
      R.

      Pereant, qui ante nos nostra dixerunt!
        So, if I understand this correctly, you are saying that my code works, but there is some constraint that it has to be in one single regex? If that's the case, then we are into some obfuscated code problem and this is perhaps the wrong place?

        If we are talking about clarity and performance, then that's different. Fewer lines of Perl code doesn't always equal faster performance. I simplified this stuff like must match exactly 4 times, etc. This speeds up the regex engine. As far as clarity goes, I would struggle to be more clear (I'm not a guru).

        If you are interested in performance, then measure and test performance (run benchmarks).

        Counting the number of lines of source code is a relatively poor predictor of actual code performance.

        Update: Well it just took some few seconds to get a negative vote on this post. I was genuinely trying to help with the original problem. I don't understand this requirement for "one line". I think that benchmarking and testing is the right way to go. I would be happy to help in this regard.

Re: regex: extract multiple number of date patterns from certain lines
by GrandFather (Cardinal) on Mar 04, 2009 at 22:15 UTC

    Can you use the trailing ,\d on the initial date/time to disambiguate? Consider:

    use strict; use warnings; while (my $line = <DATA>) { my @parts = $line =~ /(\d{4}-\d{2}-\d{2})(?:,\s|$)/g; print $line; print " ", join ("\n ", @parts), "\n"; } __DATA__ 2009-02-02 06:12:57,500 dates processed: 2009-01-31, 2009-01-29, 2009- +01-30 2009-02-18 06:03:47,713 dates processed: 2009-02-16, 2009-02-17 2009-02-19 05:58:29,138 dates processed: 2009-02-18

    Prints:

    2009-02-02 06:12:57,500 dates processed: 2009-01-31, 2009-01-29, 2009- +01-30 2009-01-31 2009-01-29 2009-01-30 2009-02-18 06:03:47,713 dates processed: 2009-02-16, 2009-02-17 2009-02-16 2009-02-17 2009-02-19 05:58:29,138 dates processed: 2009-02-18 2009-02-18

    True laziness is hard work

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://748204]
Approved by Bloodnok
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (8)
As of 2014-12-23 04:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (135 votes), past polls