Beefy Boxes and Bandwidth Generously Provided by pair Networks Joe
laziness, impatience, and hubris
 
PerlMonks  

Parsing a file in "chunks"

by vxp (Pilgrim)
on Aug 04, 2010 at 20:20 UTC ( #852963=perlquestion: print w/ replies, xml ) Need Help??
vxp has asked for the wisdom of the Perl Monks concerning the following question:

I have a tomcat log file that I need to parse. It looks like a regular tomcat log:
[datestamp] [DEBUG] something here [datestamp] [DEBUG] something else here [datestamp] [ERROR] blah blah <chunk of crap> [datestamp] [INFO] bla bla bla
what I need to do is parse this thing, and regard the [datestamp] as the start of a 'chunk' and the next [datestamp] as the end of the previous chunk (and, obviously, a beginning of a new chunk) In essence, I need to check if the <chunk of crap> between [datestamp #1] and [datestamp #2] contains a specific string. any ideas? edit: real snippet of the log file to be parsed:
[Wed Aug 04 00:10:40.591 2010] [DEBUG] [TP-Processor19, time=128089504 +0587, uri=/ibdsupport/workflow/webapp/workflow/coordinator/gotoCoordi +natorNew.cb2] [com.boylesoftware.cb2.BLOBJT.cache]: [208973459] Retri +eve cached object 'segmentProp.map' for segment 142 [Wed Aug 04 00:10:40.591 2010] [DEBUG] [TP-Processor19, time=128089504 +0587, uri=/ibdsupport/workflow/webapp/workflow/coordinator/gotoCoordi +natorNew.cb2] [com.boylesoftware.cb2.BLOBJT.cache]: [208973459] Retri +eve cached object 'segmentProp.map' for segment 142 [Wed Aug 04 00:10:40.666 2010] [ERROR] [TP-Processor19, time=128089504 +0587, uri=/ibdsupport/workflow/webapp/workflow/coordinator/gotoCoordi +natorNew.cb2] [com.boylesoftware.cb2.DAOBJT]: [208973534] Got an erro +r executing query "fetchJobQueueTime" chunk "5". com.boylesoftware.cb2.BLException: Database error in the DAO. at com.boylesoftware.cb2.DAO.executeUpdate_Internal(DAO.java:248 +3) [cb2ms.jar:na] at com.boylesoftware.cb2.DAO.executeNamedQuery(DAO.java:2634) +[cb2ms.jar:na] at com.boylesoftware.cb2.DAO.fetch_Internal(DAO.java:2524) [cb +2ms.jar:na] at com.boylesoftware.cb2.DAO.fetchWithNamedParams(DAO.java:272 +1) [cb2ms.jar:na] at com.boylesoftware.cb2.DAO.fetchWithNamedParams(DAO.java:382 +6) [cb2ms.jar:na] at com.somecompany.insidetrack.tt.project.ProjConsoleBLO.fetch +JobQueueTime(ProjConsoleBLO.java:527) [ProjConsoleBLO.class:na] at com.somecompany.insidetrack.tt.presentation.workflow.JobOve +rviewListPE.init(JobOverviewListPE.java:221) [JobOverviewListPE.class +:na] at com.boylesoftware.cb2.presentation.servlet.ShowPageAction.e +xecute(ShowPageAction.java:141) [cb2ms.jar:na] at com.boylesoftware.cb2.presentation.servlet.CB2Action.execut +e(CB2Action.java:205) [cb2ms.jar:na] at org.apache.struts.action.RequestProcessor.processActionPerf +orm(Unknown Source) [struts.jar:1.1] at java.lang.Thread.run(Thread.java:619) [na:1.6.0_16] Caused by: msjava.dbpool.DBPoolSQLException: The column prefix '#projJ +ob' does not match with a table name or alias name used in the query. + Either the table is not specified in the FROM clause or it has a cor +relation name which must be used instead. (DataSource: insidetrack-db, Type: SYBASE) [Wed Aug 04 00:10:40.666 2010] [DEBUG] [TP-Processor19, time=128089504 +0587, uri=/ibdsupport/workflow/webapp/workflow/coordinator/gotoCoordi +natorNew.cb2] [com.boylesoftware.cb2.DAOBJT]: [208973534] executing c +hunk "6" with a timeout of 360 seconds
Code would need to go through these "chunks", find the "DBPoolSQLException" and associate it with the appropriate datestamp it appears under ([Wed Aug 04 00:10:40.666 2010])

Comment on Parsing a file in "chunks"
Select or Download Code
Re: Parsing a file in "chunks"
by toolic (Chancellor) on Aug 04, 2010 at 20:28 UTC
    UPDATE: The OP was changed in a major way after my original reply. This no longer applies...

    It seems like you just want to ignore your datestamp lines:

    use strict; use warnings; while (<DATA>) { next if /\[datestamp\[/; print if /crap/; } __DATA__ [datestamp] [DEBUG] something here [datestamp] [DEBUG] something else here [datestamp] [ERROR] blah blah <chunk of crap> [datestamp] [INFO] bla bla bla
      no, cant ignore them this is needed for the script to be able to say something like: there were 17 sqlexception errors within the past 15 mins. or - only 1 suchandsuch error within the past 7 minutes from now. etc
Re: Parsing a file in "chunks"
by zek152 (Pilgrim) on Aug 04, 2010 at 20:46 UTC

    I dont have time to write out the code but my approach would be as follows

    #open the log #loop through $chunk_string = ""; while ($line = <FILE>) { #do any formatting (chomp etc) if ($line =~ /datestamp/) #appropriate regex here { if ($chunk_string =~ /whateveryouarelookingfor/) { #you found it in the current chunk } $current_datestamp = $1; #update the datestamp $line =~ s/$current_datestamp//; #ignore the date $chunk_string = ""; #reset the chunk. } $chunk_string .= $line; #add the line to the current chunk. }
Re: Parsing a file in "chunks"
by oko1 (Deacon) on Aug 05, 2010 at 00:22 UTC

    Since you haven't told us how you plan to define which chunk of the log you want to examine, I'm going to leave that part of the coding alone (except to mention that parsing datestamps isn't all that easy - I'm speaking from personal experience here. :) Splitting up the logfile into datestamp-delimited chunks, though, isn't all that tough:

    #!/usr/bin/perl -w use strict; my ($ts, %store); while (<DATA>){ $ts = $1 if /^(\[[^\]]+])/; $store{$ts} = $1 if /(DBPoolSQLException:.*)/; } for (keys %store){ # Qualify this with whatever time-stamp matching you want print "$_: $store{$_}\n"; } __END__ [Wed Aug 04 00:10:40.591 2010] [DEBUG] [TP-Processor19, time=128089504 +0587, uri=/ibdsupport/workflow/webapp/workflow/coordinator/gotoCoordi +natorNew.cb2] [com.boylesoftware.cb2.BLOBJT.cache]: [208973459] Retri +eve cached object 'segmentProp.map' for segment 142 [Wed Aug 04 00:10:40.591 2010] [DEBUG] [TP-Processor19, time=128089504 +0587, uri=/ibdsupport/workflow/webapp/workflow/coordinator/gotoCoordi +natorNew.cb2] [com.boylesoftware.cb2.BLOBJT.cache]: [208973459] Retri +eve cached object 'segmentProp.map' for segment 142 [Wed Aug 04 00:10:40.666 2010] [ERROR] [TP-Processor19, time=128089504 +0587, uri=/ibdsupport/workflow/webapp/workflow/coordinator/gotoCoordi +natorNew.cb2] [com.boylesoftware.cb2.DAOBJT]: [208973534] Got an erro +r executing query "fetchJobQueueTime" chunk "5". com.boylesoftware.cb2.BLException: Database error in the DAO. at com.boylesoftware.cb2.DAO.executeUpdate_Internal(DAO.java:248 +3) [cb2ms.jar:na] at com.boylesoftware.cb2.DAO.executeNamedQuery(DAO.java:2634) +[cb2ms.jar:na] at com.boylesoftware.cb2.DAO.fetch_Internal(DAO.java:2524) [cb +2ms.jar:na] at com.boylesoftware.cb2.DAO.fetchWithNamedParams(DAO.java:272 +1) [cb2ms.jar:na] at com.boylesoftware.cb2.DAO.fetchWithNamedParams(DAO.java:382 +6) [cb2ms.jar:na] at com.somecompany.insidetrack.tt.project.ProjConsoleBLO.fetch +JobQueueTime(ProjConsoleBLO.java:527) [ProjConsoleBLO.class:na] at com.somecompany.insidetrack.tt.presentation.workflow.JobOve +rviewListPE.init(JobOverviewListPE.java:221) [JobOverviewListPE.class +:na] at com.boylesoftware.cb2.presentation.servlet.ShowPageAction.e +xecute(ShowPageAction.java:141) [cb2ms.jar:na] at com.boylesoftware.cb2.presentation.servlet.CB2Action.execut +e(CB2Action.java:205) [cb2ms.jar:na] at org.apache.struts.action.RequestProcessor.processActionPerf +orm(Unknown Source) [struts.jar:1.1] at java.lang.Thread.run(Thread.java:619) [na:1.6.0_16] Caused by: msjava.dbpool.DBPoolSQLException: The column prefix '#projJ +ob' does not match with a table name or alias name used in the query. + Either the table is not specified in the FROM clause or it has a cor +relation name which must be used instead. (DataSource: insidetrack-db, Type: SYBASE) [Wed Aug 04 00:10:40.666 2010] [DEBUG] [TP-Processor19, time=128089504 +0587, uri=/ibdsupport/workflow/webapp/workflow/coordinator/gotoCoordi +natorNew.cb2] [com.boylesoftware.cb2.DAOBJT]: [208973534] executing c +hunk "6" with a timeout of 360 seconds

    Running the above produces the following:

    [Wed Aug 04 00:10:40.666 2010]: DBPoolSQLException: The column prefix +'#projJob' does not match with a table name or alias name used in the + query. Either the table is not specified in the FROM clause or it ha +s a correlation name which must be used instead.

    In other words, any chunk that contains a 'DBPoolSQLException' will be stored in the hash and displayed. Again, you should qualify which chunks you want to analyze before you collect the data; otherwise, you'll be flooded with reports.


    --
    "Language shapes the way we think, and determines what we can think about."
    -- B. L. Whorf
Re: Parsing a file in "chunks"
by ahmad (Hermit) on Aug 05, 2010 at 00:26 UTC

    Here's my try, You may adjust it as you want

    #!/usr/bin/perl use strict; use warnings; my %data; my $CURRENT_DATE = ''; while (<DATA>) { chomp; if ($_!~m/^\[(Sat|Sun|Mon|Tue|Wed|Fri)/ ) { ## keep adding lines into the same array element ${ $data{$CURRENT_DATE} }[-1] .= $_; }else{ # We've got a new date? my ($date) = $_=~m/^\[(.*?)\]/; $CURRENT_DATE = $date; # if we have got that key exists, then we'll push elements if (exists $data{$CURRENT_DATE} ) { push @{ $data{$CURRENT_DATE} } , $_; }else{ # otherwise, we'll create a key that have array as its val +ue $data{$CURRENT_DATE} = [$_]; } } } print "There were ", scalar @{ $data{'Wed Aug 04 00:10:40.666 2010'} } + , " Errors on Wed Aug 04 00:10:40.666 2010\n"; __DATA__ [Wed Aug 04 00:10:40.591 2010] [DEBUG] [TP-Processor19, time=128089504 +0587, uri=/ibdsupport/workflow/webapp/workflow/coordinator/gotoCoordi +natorNew.cb2] [com.boylesoftware.cb2.BLOBJT.cache]: [208973459] Retri +eve cached object 'segmentProp.map' for segment 142 [Wed Aug 04 00:10:40.591 2010] [DEBUG] [TP-Processor19, time=128089504 +0587, uri=/ibdsupport/workflow/webapp/workflow/coordinator/gotoCoordi +natorNew.cb2] [com.boylesoftware.cb2.BLOBJT.cache]: [208973459] Retri +eve cached object 'segmentProp.map' for segment 142 [Wed Aug 04 00:10:40.666 2010] [ERROR] [TP-Processor19, time=128089504 +0587, uri=/ibdsupport/workflow/webapp/workflow/coordinator/gotoCoordi +natorNew.cb2] [com.boylesoftware.cb2.DAOBJT]: [208973534] Got an erro +r executing query "fetchJobQueueTime" chunk "5". com.boylesoftware.cb2.BLException: Database error in the DAO. at com.boylesoftware.cb2.DAO.executeUpdate_Internal(DAO.java:248 +3) [cb2ms.jar:na] at com.boylesoftware.cb2.DAO.executeNamedQuery(DAO.java:2634) +[cb2ms.jar:na] at com.boylesoftware.cb2.DAO.fetch_Internal(DAO.java:2524) [cb +2ms.jar:na] at com.boylesoftware.cb2.DAO.fetchWithNamedParams(DAO.java:272 +1) [cb2ms.jar:na] at com.boylesoftware.cb2.DAO.fetchWithNamedParams(DAO.java:382 +6) [cb2ms.jar:na] at com.somecompany.insidetrack.tt.project.ProjConsoleBLO.fetch +JobQueueTime(ProjConsoleBLO.java:527) [ProjConsoleBLO.class:na] at com.somecompany.insidetrack.tt.presentation.workflow.JobOve +rviewListPE.init(JobOverviewListPE.java:221) [JobOverviewListPE.class +:na] at com.boylesoftware.cb2.presentation.servlet.ShowPageAction.e +xecute(ShowPageAction.java:141) [cb2ms.jar:na] at com.boylesoftware.cb2.presentation.servlet.CB2Action.execut +e(CB2Action.java:205) [cb2ms.jar:na] at org.apache.struts.action.RequestProcessor.processActionPerf +orm(Unknown Source) [struts.jar:1.1] at java.lang.Thread.run(Thread.java:619) [na:1.6.0_16] Caused by: msjava.dbpool.DBPoolSQLException: The column prefix '#projJ +ob' does not match with a table name or alias name used in the query. + Either the table is not specified in the FROM clause or it has a cor +relation name which must be used instead. (DataSource: insidetrack-db, Type: SYBASE) [Wed Aug 04 00:10:40.666 2010] [DEBUG] [TP-Processor19, time=128089504 +0587, uri=/ibdsupport/workflow/webapp/workflow/coordinator/gotoCoordi +natorNew.cb2] [com.boylesoftware.cb2.DAOBJT]: [208973534] executing c +hunk "6" with a timeout of 360 seconds
Re: Parsing a file in "chunks"
by aquarium (Curate) on Aug 05, 2010 at 02:55 UTC
    First of all hope you already have the "manager" application for tomcat, which is a useful basic monitoring tool. Also look at the most filtered log, so you don't parse a more general and much more active log than the DB.log. The tomcat logs should be set up as such. You might also run into the problem (at some stage) of the tomcat logs being rotated.
    Another way to tackle the original problem with the log is to join all lines with a space, to a line starting with a date. Shouldn't be too difficult with a regex. Then you merely grep for your DBPool or SQLException or whatever.
    the hardest line to type correctly is: stty erase ^H
Re: Parsing a file in "chunks"
by repellent (Priest) on Aug 05, 2010 at 04:53 UTC
    use warnings; use strict; my $i = -1; my @chunks; while (<DATA>) { ++$i if /^\[.+?\] \[[A-Z]+\]/; $chunks[$i] .= $_ if $i >= 0; } print grep { /DBPoolSQLException/ } @chunks;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://852963]
Approved by Limbic~Region
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (7)
As of 2014-04-19 11:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (480 votes), past polls