Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)

by BrowserUk (Pope)
on Jun 17, 2015 at 04:35 UTC ( #1130745=note: print w/replies, xml ) Need Help??


in reply to Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)

Apache log file formats (there are several) consist of space delimited variable width fields:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTT +P/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en +] (Win98; I ;Nav)"

(The last two quote strings may not be present.)

Every one of those fields except possibly the date and status code can vary in length; so pack and unpack which are primarily designed for fixed length fields and binary work, are the wrong tools for the job.

split is the more applicable tool for your task:

my @bits = split ' ', $line;

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
m with torvalds on this
  • Comment on Re: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)
  • Select or Download Code

Replies are listed 'Best First'.
Re^2: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)
by wrog (Friar) on Jun 17, 2015 at 04:58 UTC
    split is fine as long as the character you're splitting on does not occur in your various fields.

    If it can, which then means there then has to be a scheme for quoting such fields or escaping such characters, then we're pretty much beyond what split can do and at the point where you need to be building the Regular Expression From Hell — which can be plenty fast if you do it right, but you have to do it right — or using Text::CSV or somesuch.

      ...we're pretty much beyond what split can do...

      Mh, we know the format:

      use Data::Dump; use feature qw(say); my $line =qq(127.0.0.1 - - [22/Apr/2015:13:35:04 +1000] "GET /bin/admi +n.pl HTTP/1.1" 401 509); my @bits = split /\s/, $line; dd\@bits; say qq(Host: $bits[0]); say qq(Logname: $bits[1]); say qq(User: $bits[2]); say qq(Time: $bits[3] $bits[4]); say qq(Request: $bits[5] $bits[6] $bits[7]); say qq(Status: $bits[8]); say qq(Size: $bits[9]); __END__ monks>apache.pl [ "127.0.0.1", "-", "-", "[22/Apr/2015:13:35:04", "+1000]", "\"GET", "/bin/admin.pl", "HTTP/1.1\"", 401, 509, ] Host: 127.0.0.1 Logname: - User: - Time: [22/Apr/2015:13:35:04 +1000] Request: "GET /bin/admin.pl HTTP/1.1" Status: 401 Size: 509

      Regards, Karl

      «The Crux of the Biscuit is the Apostrophe»

        Thanks for your reply!

        A quick question dealing with the internal workings of what you wrote:

        I understand that the split function can take any expression as its element then operate on the scalar, but what would be the more nuanced differences, particularly with memory usage and processing speed, if any, between using split and a general pattern match?

        Thanks!

        I'm not sure I'd want to bet my life that none of logname, user or the request URI can have spaces in them.

      He did ask for a learning exercise; not a pre-solved solution.

      Plus, chances are the he'll need to break the composite fields down further anyway, before he can do any analysis or storage.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
        "...He did ask for a learning exercise..."

        Yes, but i didn't reply to the OP.

        "...break the composite fields down..."

        Yes, sure. Perhaps like this:

        karls-mac-mini:monks karl$ perl -E ' say split /[\[\]]/, qq([22/Apr/20 +15:13:35:04 +1000])' 22/Apr/2015:13:35:04 +1000 karls-mac-mini:monks karl$ perl -E ' say split /"/, qq("GET /bin/admin +.pl HTTP/1.1")' GET /bin/admin.pl HTTP/1.1 karls-mac-mini:monks karl$ perl -E 'say join "\t", split /\s/, qq(GET +/bin/admin.pl HTTP/1.1)' GET /bin/admin.pl HTTP/1.1 # usw...

        I just wanted to show wrog that a solution that only uses split is possible.

        Another question is this it is desirable if this is desirable. I guess some may call it abuse.

        Edit: Better wording.

        Best regards, Karl

        «The Crux of the Biscuit is the Apostrophe»

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1130745]
help
Chatterbox?
[Corion]: I think I'm overdesigning things again. I want to export(later, synchronize) data from Google Keep, by scraping the HTML. And I'm thinking of automating this by having a canary note whose text my program knows and from which it can determine the ...
[Corion]: ... surrounding HTML to scrape all the other notes. Maybe I should better look at dumping all the requests that pass between Google and my "browser" instead.

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2017-12-12 08:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What programming language do you hate the most?




















    Results (327 votes). Check out past polls.

    Notices?