comment on

Hello, I recently embarked on a journey to parse log files of arbitrary format. I've mostly got it down now, except for the timestamp format. The only information accessible to the script regarding the timestamp format is the following:

The format of the date as passed to strftime. (Such as %H:%M:%S)
The date/time is always at the beginning of the line.

The biggest issue hinges upon the fact that I don't know what the exact format is going to be, so the script is going to have to deal with the formatting string itself. Using this information, my original solution was to use the wonderful DateTime::Format::Strptime module. However, disaster soon struck. I discovered that with some lines in the log file, I could not reliably separate the timestamp and the text of the log entry, and thus could not figure out what to pass to Strptime and what to treat as log entry text.

My initial idea for a solution was to generate a regular expression from the date formatting string so that I could separate the date to easily glean the text of each log entry. Here is some sample code for the kind of setup I was planning:

#!/usr/bin/perl
use strict;
use warnings;

my $timeformat = "*%H:%M:%S%% >"; # Example.

my %replacements = (
    '%' => '\%',
    'a' => '[[:alpha:]]+',
    'H' => '\d{2}',
    'M' => '\d{2}',
    'S' => '\d{2}'
);

$timeformat = quotemeta($timeformat);
$timeformat =~ s/\\\%\\?(.)/$replacements{$1}/eg;

print ("The regular expression is: $timeformat\n");
[download]

However, during the writing of this, I realized that I would also have to deal with locales! (Apparently, some of the formatting tokens are locale-specific.) Additionally, since I'm already writing regex to extract the various values in the datestamp, I might as well also parse it to a DateTime object myself (as speed is a consideration, and Strptime alone is already a little slow), but this too introduces locale issues (weekday names, month names, AM/PM, etc.)

Surely there is a better way to do this? Wise monks, please release me from my insanity.
Thank you for taking the time to read.

(update)
Sorry, it looks like I've left pretty much everyone confused. In summary, here's the issue: due to the timestamps having many possible formats, I can't figure out how to reliably separate them from the rest of the line. The goal is to extract the data from the line without capturing part of the timestamp and then to parse the timestamp into a DateTime object.

And now for some sample data. (Although, I'm not sure how much help it will be...)
Here are some sample lines of input:

09:12: 5:14:29-!- {more garbage goes here}
09:12: 5:14:37
09:12: 5:14:37
[download]

In this data sample, the first timestamp is "09:12: 5:14:29" corresponding to the format "%y:%m:%e:%H:%M". The second two lines have no data.

Here are some more (with a different timestamp format):

2008-12-12 00:39  * {more stuff here}
2008-12-12 01:17 < {data here}
2008-12-12 01:30 
2008-12-12 01:31
[download]

The format in this sample is "%F %H:%M " (with an extra space at the end), and the data for the first line is " * {more stuff here}" (with a space at the beginning). On the second line, the data is "< {data here}". The last two lines have no data, only timestamp.

Since I only care about separating the timestamp and the rest of the line, I don't have to actually parse the varying data formats. I just need to somehow parse and remove the timestamp portion of each line, given the strftime format. In the examples, the strftime formats (which are accessible to the script) can be converted to the two regular expressions below, respectively:

(\d{2})(\d{2})([\d\s]\d)(\d{2})(\d{2})
(\d{4})\-(\d{2})\-(\d{2})\ (\d{2})\:(\d{2})\
[download]

In reply to Parsing arbitrarily-formatted timestamps out of log file entries by mr_flea

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


good chemistry is complicated, and a little bit messy -LW
	PerlMonks