Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

webserver common log format file script

by panagiotis (Initiate)
on Sep 06, 2011 at 11:56 UTC ( #924367=perlquestion: print w/ replies, xml ) Need Help??
panagiotis has asked for the wisdom of the Perl Monks concerning the following question:

Hello guys! I am new to perl scripting and I have to say amazed too! I need a little help for my perl script. I am trying to data mine a common log format and I preprocess it with a perl script. I use this code to ignore some urls

 if ($values[6]!~m /(css|gif|png|jpg|swf|CSS|GIF|PNG|JPG|JS|js|ico|ICO|txt|TXT)$/ && ($values[6]!~m/(cmd.exe|root.exe|default.ida)/) && (($values[8]=~m/(200)/) || ($values[8]=~m/(304)/) || ($values[8]=~m/(302)/) || ($values[8]=~m/(301)/)))

but i dont know how to ignore some urls like that

 /index.php?option=com_content&task=category&sectionid=4&id=14&Itemid=28

or that one

 /icte/viewtopic.php?f=7&t=19&start=0&st=0&sk=t&sd=a&sid=9d1ecb96bad6e27bae6468d3bf482686

do you have any ideas??? Thank you in advance!! :)

Comment on webserver common log format file script
Select or Download Code
Re: webserver common log format file script
by moritz (Cardinal) on Sep 06, 2011 at 12:13 UTC
    What makes these URLs harder for you to ignore than the others you are already ignoring?

    Assuming that $values[6] calls the requested URL, the regex isn't restricted to file endings, you can just as well match m{^/ict/viewtopic\.php} or so.

Re: webserver common log format file script
by ww (Bishop) on Sep 06, 2011 at 12:22 UTC

    We may need just a little more help from you to offer help to you. Detail for us: what are the criteria by which a log entry should be ignored?

    Working that one out, explicitly, may even give you the answer. :-)

    But, that said, suggest you replace:

    || ($values[8]=~m/(304)/) || ($values[8]=~m/(302)/) || ($values[8]=~m/(301)

    with

    || ($values[8]=~m/(30[124]/)

    The use of the character class -- essentially meaning 'any of the elements inside the square brackets' will make your code more readable (and hence, more maintainable).

Re: webserver common log format file script
by chrestomanci (Priest) on Sep 06, 2011 at 13:16 UTC

    Is this the log file from a well know web server?

    If so, I would look on CPAN for existing log file parsing code, such as Apache::ParseLog.

    Others have come before you and have solved this problem. There should be no need to re-invent wheels or get caught by all the gotchas that they have found and worked around.

Re: webserver common log format file script
by Marshall (Prior) on Sep 06, 2011 at 14:57 UTC
    Your index values of @values, don't seem to match up with the standards for Common Log Format. You may be splitting on space instead of using a better regex. This parsing problem has already been solved. Ch 8, Parsing Web Access Logs of Perl for Web Site Management is from 2001, but lots is still relevant. There is a good explanation of the author's regex. This also explains how to search CPAN for modules that might be appropriate for your task.

    In Perl, it is not necessary to write things like $values[ 6 ]. Instead, Perl has list slice. See slices in http://perldoc.perl.org/perldata.html. In other languages, you have to keep track of stuff like the "sixth thing" or index 6 means status, in Perl, just assign to a $status variable directly: ($status) = (split)[6]; This enhances readability.

    Anyway, I would use some well known code or use a CPAN module to parse your log line. Then it would help if you explained a bit more about your troubles. You obviously seem to to understand the basics of regex. Where are you going wrong?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://924367]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (6)
As of 2014-07-31 23:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (255 votes), past polls