Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re^13: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)

by BrowserUk (Pope)
on Jun 18, 2015 at 23:33 UTC ( #1131088=note: print w/replies, xml ) Need Help??


in reply to Re^12: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)
in thread Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)

quick glance of the source

You didnt look close enough.

This regex:

q{([^\s]*)\s+([^\s]*)\s+([^\s]*)\s+\[(([^: ]+):([^ ]+) ([-+0-9]+))\]\s ++"(([^\s]+) ([^\s]+)( ([^\s"]*))?)"\s+([^\s]*)\s+([^\s]*)};

Won't match "-", because it expects and requires at least two space delimited fields within the quotes; and allows for a third.

Note also that both ID fields are expected to match [^\s]* (I guess he's not aware of \S; and it should at least be + not *; which could be an indication of his perl experience.).

So, a "proper parser" would break. Maybe it has a back-up plan for if the regex fails; but equally, it's simple to code a back up plan for the white space split also.

So let's review:

  1. The OP posted asked about using pack & unpack, and a couple of early responders posted, with positive sounding confirmations.
  2. I countered by informing him that pack & unpack were completely inappropriate for the task; and suggested split as a starting point in his "personal learning experience".
  3. You pop up and rather than trying to help the op; you attempt to pick holes in my post; despite that its purpose was to save the OP wasting time with pack & unpack.
  4. So, I reminded you: "He did ask for a learning exercise; not a pre-solved solution.".
  5. So you come back with this guess: "(or if Apache really does go to some pains to make sure spaces never show up in the various log fields -- say by always representing them as + or %20 -- then yay, but I'm not sure this is actually true.)".

    Which is demonstrably wrong!

  6. You retort with: "which says nothing about logname and user,".

    Look at the regex above! Wrong again.

  7. And "nor does it guarantee that the HTTP command field always consists of exactly 3 space-separated components ".

    Also wrong!

  8. So then you throw "10.54.33.35 - - [18/Jun/2015:09:05:55 -0700] "-" 408 0" into the mix.

    And, as I've shown above, that would (without special handling) break most pre-solved solutions; which I'll remind you: the OP explicitly didn't want.

    And which could just as easily be handled by a special case with the split version.

    You know, as a part of the personal learning experience!

    A big part of which might be that having tried it for himself; he'd decides to opt for a pre-solved solution.

    Or he might decide to write his own CPAN module that does it better than any of the existing ones.

    That's his choice.

    All I did was short circuit his learning, by informing him that pack & unpack were definitely the wrong tools to start with.

So, here we are 13 levels deep; and you've become boring. No attempt to help the OP; just banging on about stuff it seems you barely understand.

So, I'm bored and done. T'was fun.

Update: I forgot this little gem. You offered this wishy-washy suggestion "or using Text::CSV or somesuch"; but then later suggest that split will break because "which says nothing about logname and user,"; completely oblivious to the fact that if either ID contained spaces; it would break that module also!


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
  • Comment on Re^13: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)
  • Select or Download Code

Replies are listed 'Best First'.
Re^14: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)
by wrog (Friar) on Jun 19, 2015 at 02:13 UTC

    What a masterpiece of projection (*). Likewise for the self-aggrandizing rewrite of history. Anyway, congratulations on earning my very first downvote.

    Just to make sure I wasn't going completely crazy, I ran Apache::Log::Parser on my server log and it correctly parsed all of the 408 lines just fine, in both fast and strict modes.

    Which means the regexp you cited is evidently not used all of the time. The "proper parser" does not, in fact, break, and the module author (unlike you) evidently did put a fair amount of thought into this -- as was clear enough from my first read of the code.

    Re "You pop up and rather trying to help the op; you attempt to pick holes in my post; despite that its purpose was to save the OP wasting time with pack & unpack." Actually no. I was simply pointing out that split is fine if you're REALLY SURE about the delimiter character. But in this context, there's good reason NOT to be and therefore split is likely to be almost as big a waste of time as pack. If you want to learn how regexps work, then just learn them already. If you don't, then use Text::CSV. Both observations intended to help the OP.

    And nothing you've said has invalidated either of those observations.

    Also note that at NO point did the OP indicate any aversion to using existing packages -- "learning exercise" does not always mean "wanting to write everything from scratch".

    Re: logname and user. I don't specifically know what happens in Apache if the Authn handler passes back a username with a space it -- quoting it with '%20' or '+' actually would be a good idea -- which I indicate by saying things like "I don't know for sure". But apparently you don't know either, how else to explain why you keep quoting the HTTP and URI specs at me as if they had anything to do with this. Yes, Text::CSV will fail on this case; I'll give you that (oh, no I made a mistake -- the world will end now).

    (*) we'll let the peanut gallery judge who in this thread is deathly afraid of being occasionally wrong about stuff.

      Also note that at NO point did the OP indicate any aversion to using existing packages

      Apparently you didn't even read the title: without All-In-One Modules from CPAN.

      we'll let the peanut gallery judge who in this thread is deathly afraid of being occasionally wrong about stuff.

      Really? Is that what you think? Wrong again. Take a look: http://perlmonks.com/?node_id=3989;BIT=Sorry.;a=browseruk;Wi;M

      I'm just bored with YAA; who's made zero attempt to talk to OP.

      Had you put as much effort into trying to help the OP; as you have trying to prove that split isn't the only way to tackle the problem(*); you'd have achieved something worthwhile.

      *But that was obvious from the start

      As is, all you've done is waste everyone's time.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
      I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
        wait, I thought you were bored and going away. what happened?
        3. So, I reminded you: "He did ask for a learning exercise; not a pre-solved solution.".

        At no point did I provide a pre-solved solution, I simply indicated directions to pursue, just like you did. Except I provided 2 different ones that were more likely to be useful. Sorry about that; I'll try to do worse next time.

        4. So you come back with this guess: "(or if Apache really does go to some pains to make sure spaces never show up in the various log fields -- say by always representing them as + or %20 -- then yay, but I'm not sure this is actually true.)".
        Which is demonstrably wrong!

        Which inexplicably links to the HTTP/URI specs, which as I have repeatedly pointed out has nothing to do with this. (I'm beginning to wonder if I know more about the innards of Apache than you do)

        5. You retort with: "which says nothing about logname and user,".

        Look at the regex above! Wrong again.

        Right. The regexp which doesn't always get used (which you'd know if you'd bothered to read the code at least as carefully as I did, which I'll admit wasn't all that carefully, but sufficed...).

        6. And "nor does it guarantee that the HTTP command field always consists of exactly 3 space-separated components ".
        Also wrong!
        Oh look, another link to the HTTP spec. Meanwhile, I provide not one but TWO sources of counterexamples (script kiddies, and 408 lines, neither of which are particularly rare) that could lead to the HTTP command in the log file not consisting of 3 space-separated components -- the sorts of things that would screw up a split solution ... that the OP would most likely want to know about (and are hence worth mentioning)
        7. So then you throw "10.54.33.35 - - 18/Jun/2015:09:05:55 -0700 "-" 408 0" into the mix.
        And, as I've shown above, that would (without special handling) break most pre-solved solutions;
        Yet another statement pulled out of your ass that turns out not to be true which then has to be corrected (again Apache::Log::Parser works just fine, thank you very much)
        which I'll remind you: the OP explicitly didn't want.
        Never mind that you are the only one in this thread who's been bringing up pre-solved solutions.

        So who exactly is wasting everybody's time again?

        This is exactly the sort of behavior that you decry in others. Heaven forbid anybody should call you out on your own bullshit.

      (*) we'll let the peanut gallery judge who in this thread is deathly afraid of being occasionally wrong about stuff.

      Hehehe ... what is there to fear about being wrong on the internet?

      Both observations intended to help the OP.

      SO how come you haven't offered your observations to the OP?

        Um, I did, actually.

        It's true somebody else came in with the all-regexp solution. I don't do me-too posts.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1131088]
help
Chatterbox?
[Corion]: Hi choroba! Yay, day off sounds great! I hope the performances of the kids go well!
[Corion]: I think I'm overdesigning things again. I want to export(later, synchronize) data from Google Keep, by scraping the HTML. And I'm thinking of automating this by having a canary note whose text my program knows and from which it can determine the ...
[Corion]: ... surrounding HTML to scrape all the other notes. Maybe I should better look at dumping all the requests that pass between Google and my "browser" instead.

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (7)
As of 2017-12-12 08:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What programming language do you hate the most?




















    Results (327 votes). Check out past polls.

    Notices?