Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

String Matching

by stevbutt (Novice)
on Aug 13, 2012 at 23:41 UTC ( #987241=perlquestion: print w/ replies, xml ) Need Help??
stevbutt has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks

Please help with some wise and efficient string matching wisdom

Input :

May  2 04:06:15 lon.mail.net exim[17905]: 2012-07-03 07:06:15 1SPPtO-0004en-PS <= me@ours.co.uk H=smtpout.mail.com [22.5.10.4] I=[6.5.14.4]:25 P=esmtp S=13333 id=6aeca3b79b8892d6105dab131c76f066@localhost.localdomain T="Half price offer"

I want to grab the IP address ( 22.5.10.4 without the square brackets ) the email address ( me@ours.co.uk which always follows <= )

so far I have the ip address but with the square brackets using :

my ($srvrip) = $remainder =~ m/H=.+?(\[.+?\])/;

How can I extract the email address ?

I have a lot of lines in the log files so need this to be as efficient as possible and am also restricted to perl 5.8.4

Hope you can help

Comment on String Matching
Select or Download Code
Re: String Matching
by GrandFather (Cardinal) on Aug 14, 2012 at 01:12 UTC

    What have you tried?

    As an aside don't fall for the "efficient as possible" tripe. Getting wrong answers fast is not generally considered a good solution. Work on getting the correct answers first then (and only if the solution takes too long to run) consider how you can make it faster.

    True laziness is hard work

      This is just so true.

Re: String Matching
by davido (Archbishop) on Aug 14, 2012 at 01:51 UTC

    m/<=\s*(\S+)[^[]+\[([^\]]+)/

    Here it is with nicer formatting and a basic explanation:

    m/ <=\s* (\S+) # Capture the email address following <= [^[]+\[ # Skip to the first subsequent square bracket. ([^\]]+) # Capture until a closing bracket. /x

    You can tinker with it yourself here.

    The email address will be in $1 and the IP will be in $2, following a successful match.

    Update: Silly me for trusting the OP's spec. Kenosis mentioned to me that the exim record could, in addition to <= also contain any of ==, **, =>, *>, ->, and possibly some others. So the <= anchor is probably not ideal, but could be improved upon with (?:<=|==|\*\*|=>|\*>|=>) (plus whatever others are legal).


    Dave

      Thanks Dave,

      The Spec is correct - This is already in a if/ifelse statement where we know if we are dealing with == ** etc So what you have shown me is just perfect,

      many thanks

      Steve

        Fantastic! My faith in humanity is restored. ;) ...and I'm glad it worked for you.


        Dave

Re: String Matching
by rpnoble419 (Pilgrim) on Aug 14, 2012 at 07:08 UTC
    If the layout is fixed (that is if the data changes but the position of the data does not change, then try this:
    $_='May 2 04:06:15 lon.mail.net exim[17905]: 2012-07-03 07:06:15 1SPP +tO-0004en-PS <= me@ours.co.uk H=smtpout.mail.com [22.5.10.4] I=[6.5.1 +4.4]:25 P=esmtp S=13333 id=6aeca3b79b8892d6105dab131c76f066@localhost +.localdomain T="Half price offer"'; my @data= split(/ /); my $Email=$data[10]; my $IP=$data[12]; $IP=~s/\[//g; $IP=~s/\]//g; print "Email: $Email\n"; print "IP: $IP\n";
    As you are limited to Perl 5.8.4, regex's are not as fast as in 5.10 and up so I would try to limit the data I perform a regex on as you never know what will change and cause your program to bomb (usually at 3:00am on a Sunday morning). I would split your data into its many parts and then run what ever regex you need on a smaller data chunk. For the email you don't even need a regex. The square brackets can be removed in any number of ways, I choose the lazy way in my example.
Re: String Matching
by 2teez (Priest) on Aug 14, 2012 at 07:43 UTC
    Hi,

    If your logfile has it's data with fixed "width", then using unpack function can really come in handy! And you really wouldn't border on perl version you are using. see this:

    use warnings; use strict; my $str = 'May 2 04:06:15 lon.mail.net exim[17905]: 2012-07-03 07:06:15 1SPPtO- +0004en-PS <= me@ours.co.uk H=smtpout.mail.com [22.5.10.4] I=[6.5.14.4 +]:25 P=esmtp S=13333 id=6aeca3b79b8892d6105dab131c76f066@localhost.lo +caldomain T="Half price offer"'; my ( $e_mail, $ip ) = unpack "x82A13x21A9", $str; print "EMAIL: ", $e_mail, "\nIP: ", $ip, $/; # OR while (<DATA>) { my ( $e_mail, $ip ) = unpack "x82A13x21A9", $_; print "EMAIL: ", $e_mail, "\nIP: ", $ip, $/; } __DATA__ May 2 04:06:15 lon.mail.net exim[17905]: 2012-07-03 07:06:15 1SPPtO-0 +004en-PS <= me@ours.co.uk H=smtpout.mail.com [22.5.10.4] I=[6.5.14.4] +:25 P=esmtp S=13333 id=6aeca3b79b8892d6105dab131c76f066@localhost.loc +aldomain T="Half price offer"
    OUTPUT
    EMAIL: me@ours.co.uk
    IP: 22.5.10.4

    Check perldoc perlpacktut for more info.

    UPDATE: Oops! my bad I missed that but was pointed out by Kenosis though, Please Note however, if the length of the field to be gotten varies, then unpack function will NOT also work.
    However, I had mentioned perviously that the logfiles data MUST have a FIXED WIDTH.

Re: String Matching
by linuxkid (Sexton) on Aug 14, 2012 at 15:15 UTC

    remember that perl doesn't do greedy matching, so, try: /.*<=(.*)\s*\[(\d?\d\d\.\d?\d\d\.\d?\d\d\.\d?\d\d).*/ $1 will be the email, and $2 will be the ip.

    --linuxkid


    imrunningoutofideas.co.cc
      Just for the record, the standard quantifiers in Perl regular expressions are, indeed, greedy.

        They are greedy, but not too clever

        my $str = "xaxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"; $str =~ /(a+)/; print "$1\n"; # prints "a" not "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
        The quantifiers are greedy, but they refuse to let go of something they found unless forced even if they could get more someplace later.

        Jenda
        Enoch was right!
        Enjoy the last years of Rome.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://987241]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (10)
As of 2014-10-20 21:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (92 votes), past polls