Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Parsing arbitrarily-formatted timestamps out of log file entries

by mr_flea (Novice)
on Dec 11, 2009 at 22:12 UTC ( #812454=perlquestion: print w/ replies, xml ) Need Help??
mr_flea has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I recently embarked on a journey to parse log files of arbitrary format. I've mostly got it down now, except for the timestamp format. The only information accessible to the script regarding the timestamp format is the following:

  • The format of the date as passed to strftime. (Such as %H:%M:%S)
  • The date/time is always at the beginning of the line.

The biggest issue hinges upon the fact that I don't know what the exact format is going to be, so the script is going to have to deal with the formatting string itself. Using this information, my original solution was to use the wonderful DateTime::Format::Strptime module. However, disaster soon struck. I discovered that with some lines in the log file, I could not reliably separate the timestamp and the text of the log entry, and thus could not figure out what to pass to Strptime and what to treat as log entry text.

My initial idea for a solution was to generate a regular expression from the date formatting string so that I could separate the date to easily glean the text of each log entry. Here is some sample code for the kind of setup I was planning:

#!/usr/bin/perl use strict; use warnings; my $timeformat = "*%H:%M:%S%% >"; # Example. my %replacements = ( '%' => '\%', 'a' => '[[:alpha:]]+', 'H' => '\d{2}', 'M' => '\d{2}', 'S' => '\d{2}' ); $timeformat = quotemeta($timeformat); $timeformat =~ s/\\\%\\?(.)/$replacements{$1}/eg; print ("The regular expression is: $timeformat\n");

However, during the writing of this, I realized that I would also have to deal with locales! (Apparently, some of the formatting tokens are locale-specific.) Additionally, since I'm already writing regex to extract the various values in the datestamp, I might as well also parse it to a DateTime object myself (as speed is a consideration, and Strptime alone is already a little slow), but this too introduces locale issues (weekday names, month names, AM/PM, etc.)

Surely there is a better way to do this? Wise monks, please release me from my insanity.
Thank you for taking the time to read.

(update)
Sorry, it looks like I've left pretty much everyone confused. In summary, here's the issue: due to the timestamps having many possible formats, I can't figure out how to reliably separate them from the rest of the line. The goal is to extract the data from the line without capturing part of the timestamp and then to parse the timestamp into a DateTime object.

And now for some sample data. (Although, I'm not sure how much help it will be...)
Here are some sample lines of input:

09:12: 5:14:29-!- {more garbage goes here} 09:12: 5:14:37 09:12: 5:14:37

In this data sample, the first timestamp is "09:12: 5:14:29" corresponding to the format "%y:%m:%e:%H:%M". The second two lines have no data.

Here are some more (with a different timestamp format):

2008-12-12 00:39 * {more stuff here} 2008-12-12 01:17 < {data here} 2008-12-12 01:30 2008-12-12 01:31

The format in this sample is "%F %H:%M " (with an extra space at the end), and the data for the first line is " * {more stuff here}" (with a space at the beginning). On the second line, the data is "< {data here}". The last two lines have no data, only timestamp.

Since I only care about separating the timestamp and the rest of the line, I don't have to actually parse the varying data formats. I just need to somehow parse and remove the timestamp portion of each line, given the strftime format. In the examples, the strftime formats (which are accessible to the script) can be converted to the two regular expressions below, respectively:

(\d{2})(\d{2})([\d\s]\d)(\d{2})(\d{2}) (\d{4})\-(\d{2})\-(\d{2})\ (\d{2})\:(\d{2})\

Comment on Parsing arbitrarily-formatted timestamps out of log file entries
Select or Download Code
Re: Parsing arbitrarily-formatted timestamps out of log file entries
by GrandFather (Cardinal) on Dec 11, 2009 at 23:19 UTC

    How about showing us some sample data (especially the difficult edge cases)?

    Actually, having reread your node several times it's still not clear to me what it is that you are trying to achieve. Your title implies that you need to parse log files, but the node content looks more like you need to generate log file parsers based on some sort of meta description. It's sorta important to know what it is you want to achieve and sample data along with a little sample code would help a lot.


    True laziness is hard work
Re: Parsing arbitrarily-formatted timestamps out of log file entries
by chuckbutler (Vicar) on Dec 11, 2009 at 23:44 UTC

    Do not underestimate using unpack for extracting a date and/or time sequence from a record. Also, split the record, using split, into manageable fields. Normally, there is a TAB or a COLON separating the fields. A Google of “unix log file format” may also shed some light. Good luck.

Re: Parsing arbitrarily-formatted timestamps out of log file entries
by Anonymous Monk on Dec 12, 2009 at 02:53 UTC
Re: Parsing arbitrarily-formatted timestamps out of log file entries
by Ieronim (Friar) on Dec 12, 2009 at 20:03 UTC
    The simplest way is to split the input string and use Date::Manip::ParseDate recursively on resulting array, as it does exactly what you need, i.e. removes the part containing date from the beginning of input without any knowledge about the format.

    Example:

    #!/usr/bin/perl use warnings; use strict; use Date::Manip qw(); # I strongly recommend version 5.54 our $TZ = 'GMT'; # to avoid error message while (<DATA>) { s/(-!-)/ $1/; #to deal with strange comment format my @line = split /(\s+)/; 1 while (Date::Manip::ParseDate(\@line)); $_ = (join "", @line) || "\n"; print; } __DATA__ 09:12: 5:14:29-!- {more garbage goes here} 09:12: 5:14:37 09:12: 5:14:37 2008-12-12 00:39 * {more stuff here} 2008-12-12 01:17 < {data here} 2008-12-12 01:30 2008-12-12 01:31
    However, you need to test this method on all types of your data to check if it 'eats' all your bizarre date formats.
Re: Parsing arbitrarily-formatted timestamps out of log file entries
by mr_flea (Novice) on Dec 12, 2009 at 23:01 UTC

    I ended up removing the dates by generating regex from the strftime format to match them, using this:

    sub timestamp2regex { my $exp = shift; my %metareplacements = ( 'D' => '%m/%d/%y', 'F' => '%Y-%m-%d', 'r' => '%I:%M:%S %p', 'R' => '%H:%M', 'T' => '%H:%M:%S' ); my %replacements = ( 'a' => '[[:alpha:]]+', 'A' => '[[:alpha:]]+', 'b' => '[[:alpha:]]+', 'B' => '[[:alpha:]]+', 'd' => '\d{2}', 'e' => '[\d\s]\d', 'g' => '\d{2}', 'G' => '\d{4}', 'h' => '[[:alpha:]]+', 'H' => '\d{2}', 'I' => '\d{2}', 'j' => '\d{3}', 'k' => '[\d\s]\d', 'l' => '[\d\s]\d', 'm' => '\d{2}', 'M' => '\d{2}', 'p' => '[A-Za-z.]{2,}', 'P' => '[A-Za-z.]{2,}', 's' => '\d+', 'S' => '\d{2}', 't' => '\t', 'u' => '\d', 'U' => '\d{2}', 'V' => '\d{2}', 'w' => '\d', 'W' => '\d{2}', 'y' => '\d{2}', 'Y' => '\d{4}', 'z' => '[+-]\d{4}', 'Z' => '[[:alpha:]]*', '%' => '\%' ); $exp = quotemeta($exp); $exp =~ s/\\\%\\?(.)/ if (defined $metareplacements{$1}) { timestamp2regex($metareplacements{$1}); } elsif (defined $replacements{$1}) { $replacements{$1}; } else { croak "Unsupported or unrecognized timestamp format token: + \%$1."; }/eg; return $exp; }

    (This turned out to be much easier to write than I expected, after I gave up with locales.)

    This isn't completely ideal, because it doesn't accept anything locale-related (it will croak on %c, %E, %O, %x, and %X), but I don't think those are actually going to be used. After writing this, I discovered Regexp::Common::time, which appears to be exactly what I was after (and somewhat what this code does), but it's much longer than my code, and I'm not sure if it handles certain things (like non-English AM/PM) as well as my code does. If I run into any locale issues with mine, though, I'll probably switch to that.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://812454]
Approved by ikegami
Front-paged by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (8)
As of 2014-12-27 23:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (177 votes), past polls