Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: Working with a very large log file (parsing data out)

by BrowserUk (Pope)
on Feb 20, 2013 at 07:50 UTC ( #1019728=note: print w/ replies, xml ) Need Help??


in reply to Working with a very large log file (parsing data out)

If the file wasn't so large, I could just do something like:cat logfile.log | awk {'print $4'} | sort | uniq -c However, reading a 1.5TB file in to memory just isn't going to work :)

That command chain ought to work as is -- even with a very large file -- because each process in the chain (except sort) processes the file data line by line. And although sort needs to process the entire file, it knows how to use temporary files to spill intermediate results avoiding memory exhaustion.

I'm not saying it will be fast. But it should work.

However, something like this should also do the trick and be substantially faster (~1.25 60 hours):

perl -anle"++$h{ $F[ 4 ] } }{ print qq[$h{ $_ } $_] for sort keys %h" +theLogFile > resultsFile

Update: You might need $F[3]. I can't remember if awk's field numbers are zero-based or one-based?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re: Working with a very large log file (parsing data out)
Select or Download Code
Re^2: Working with a very large log file (parsing data out)
by tmharish (Friar) on Feb 20, 2013 at 08:18 UTC
    ... and be substantially faster (~1.25 hours)

    How did you figure the time?

      By running it on a 5.4GB logfile -- that took 12.5 minutes -- and then scaling: 1.5TB / 5.4GB = 285 * 12.5 = 35550 / 60 = 59.2. + (a bit for contingency) = 75.

      And then making the mistake of treating that as minutes instead of hours!

      Thank you for the heads up, I'll correct the above!


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1019728]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2014-07-26 04:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (175 votes), past polls