good chemistry is complicated, and a little bit messy -LW |
|
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
Hello, I recently embarked on a journey to parse log files of arbitrary format. I've mostly got it down now, except for the timestamp format. The only information accessible to the script regarding the timestamp format is the following:
The biggest issue hinges upon the fact that I don't know what the exact format is going to be, so the script is going to have to deal with the formatting string itself. Using this information, my original solution was to use the wonderful DateTime::Format::Strptime module. However, disaster soon struck. I discovered that with some lines in the log file, I could not reliably separate the timestamp and the text of the log entry, and thus could not figure out what to pass to Strptime and what to treat as log entry text. My initial idea for a solution was to generate a regular expression from the date formatting string so that I could separate the date to easily glean the text of each log entry. Here is some sample code for the kind of setup I was planning:
However, during the writing of this, I realized that I would also have to deal with locales! (Apparently, some of the formatting tokens are locale-specific.) Additionally, since I'm already writing regex to extract the various values in the datestamp, I might as well also parse it to a DateTime object myself (as speed is a consideration, and Strptime alone is already a little slow), but this too introduces locale issues (weekday names, month names, AM/PM, etc.)
Surely there is a better way to do this? Wise monks, please release me from my insanity. (update) And now for some sample data. (Although, I'm not sure how much help it will be...)
In this data sample, the first timestamp is "09:12: 5:14:29" corresponding to the format "%y:%m:%e:%H:%M". The second two lines have no data. Here are some more (with a different timestamp format):
The format in this sample is "%F %H:%M " (with an extra space at the end), and the data for the first line is " * {more stuff here}" (with a space at the beginning). On the second line, the data is "< {data here}". The last two lines have no data, only timestamp. Since I only care about separating the timestamp and the rest of the line, I don't have to actually parse the varying data formats. I just need to somehow parse and remove the timestamp portion of each line, given the strftime format. In the examples, the strftime formats (which are accessible to the script) can be converted to the two regular expressions below, respectively:
|
|