Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Strange Behavior while Parsing Sendmail logs

by Russ (Deacon)
on Jul 19, 2000 at 01:03 UTC ( #23061=perlquestion: print w/replies, xml ) Need Help??

Russ has asked for the wisdom of the Perl Monks concerning the following question:

Jul 17 10:26:30 host@our.domain.edu sendmail[8436]:
e2HERfF08329: to="a-marsh@ussr.net" <a-marsh@ussr.net>,
delay=00:00:49, xdelay=00:00:01, mailer=esmtp, pri=889271,
relay=mxpool01.netaddress.ussr.net. [204.68.24.19],
stat=250 2.0.0 Sent (Mail accepted (634ejqoAE1429M05))

This log line (and others like it, all have double quotes in the email address), does not behave as i think it should. $_ contains the log line when i execute the following statement:

($to_addr = $_) =~ s/.* to=([^,]+), .*/$1/;
However, $to_addr does not end up containing the data following the 'to=' delimiter, it ends up containing the whole log line. It would seem that somehow the whole log line is getting matched by the ([^,]+) part of the regex because that's what's getting substituted back in.

What's even stranger is if i simulate this situation in the debugger (by entering the log line into a variable and performing the same substition) it works exactly as expected.

The log line has been altered for obvious reasons. It is of course on a single line in its original form.

And yes, I know you can't technically write a regex to match an email address, but this *should* be close enough.

Thanks.

Replies are listed 'Best First'.
(chromatic) Re: Strange Behavior while Parsing Sendmail logs
by chromatic (Archbishop) on Jul 19, 2000 at 01:49 UTC
    I like this construct better, though an if block is easier to read: s/.* to=([^,]+), .*/$1/ && do { $to_addr = $_ }; If the substitution succeeds, $to_addr is set to the new value of $_.
      Out of curiosity, what does phrasing it in that way buy you?

      I thought maybe you were gaining an advantage from the short circuit of && so I wrote a benchmark, but the times came back so close together that the difference is easily attributed to background processes. So I ran the deparser and found that perl converts this to an if(){} structure, which is (as you said) easier to read.

      E:\Projects>perl -MO=Deparse -we "my $s; $_ && do { $s = $_ }" my $s; if ($_) { $s = $_; } -e syntax OK
        My guilty admission is that I just like the way it looks better.
Re: Strange Behavior while Parsing Sendmail logs
by tye (Sage) on Jul 19, 2000 at 03:44 UTC
    • It would seem that somehow the whole log line is getting matched by the (^,+) part of the regex because that's what's getting substituted back in.

    Just in case this isn't clear yet, it isn't that the mentioned part of the regex is matching the whole line, it is that the regex is failing and no substitution is taking place

    It does no good to have an unanchored .* at both ends.

    This looks like one of the regex optimizer bugs. You should post it to comp.lang.perl.moderated or submit it with perlbug.

    I'd write this:

    if( $_ =~ / to=([^,]+), / ) { $to_addr = $1; } else { warn "Unmatched to= in line: $_"; }

      It does no good to have an unanchored .* at both ends. but it does because regexps are greedy (by default). If they weren't there, the substitution would do nothing, it would just put $1 back where it found it. But the .* cause the whole line to be matched, and thus *only* the $1 is put back in. that said it is no doubt safer to anchor such a regex

        Sorry, you are right.

        I think I was thinking ahead to the change I made to just match against the line and just assign $1 to the variable. Mea culpa.

Re: Strange Behavior while Parsing Sendmail logs
by le (Friar) on Jul 19, 2000 at 11:51 UTC
    hmmm, what about this one:
    $to_addr = $1 if /.+? to=([^,]+), .*/;
RE: Strange Behavior while Parsing Sendmail logs
by DrManhattan (Chaplain) on Jul 19, 2000 at 19:10 UTC

    I ran your regex over my sendmail logs and it behaved pretty much as expected. The only lines it missed were emails directed at multiple recipients, like so:

    Jul 19 02:43:29 zoom1 sendmail[26193]: CAA26174: to=<XXX@aol.com>,<YYY@aol.com>,<ZZZ@aol.com>, ctladdr=<XXX@TelePath.Com> (13408/40), delay=00:01:14, xdelay=00:00:01, mailer=esmtp, relay=zd.mx.aol.com. [152.163.224.101], stat=Sent (OK)

    Modifying the regex a bit cleared that up and I didn't get any more anomalous behavior. Here's the test code I used:

    #!/usr/bin/perl while (<STDIN>) { # Only match lines that have a " to=" in them. # The leading space is important because many # lines have a "proto=" if (/ to=/) { #($to_addr = $_) =~ s/.* to=([^,]+), .*/$1/; ($to_addr = $_) =~ s/.* to=(.+?), .*/$1/; print "$to_addr"; } }

    -Matt

      What version of perl are you running? We are a little
      behind the curve (5.004) and I suspect that may be the
      reason.

      -Mark

        5.005_02

        -Matt

Re: Strange Behavior while Parsing Sendmail logs
by Odud (Pilgrim) on Jul 19, 2000 at 12:01 UTC
    Is there something unusual about how you are calling the script when it doesn't work? What version of perl are you using? I've pasted your examples and it works fine when I run it under NT at version 5.6.0 and under HP-UX 11 at version 5.004_04 - albeit with the data embedded at the end of the script. I know you said that the data is all one line but I once had trouble with output from mail handlers that were adding newlines every 72? characters - when I looked at this on the terminal it was hard to spot because they coincided with where the wrap would naturally occur.
Re: Strange Behavior while Parsing Sendmail logs
by lhoward (Vicar) on Jul 19, 2000 at 01:15 UTC
    Try.
    ($to_addr)=$_=~/......
    instead of what you have. I don't have time to go into the details of why right now. I will ammend this note later with full details.
RE: Strange Behavior while Parsing Sendmail logs
by flyfishin (Monk) on Jul 19, 2000 at 18:19 UTC
    Since we don't have the entire script I am taking a guess here. I assume you are opening the log and reading through each line like so:
    open FH, "mylog" etc.... while (<FH>) { do these steps }

    If you have the regex you listed in the "do these steps section" then as it is written $to_addr will get assigned the entire line because you have parenthesized that part and set the precedence for the operations. You then run the regex on the line but it never gets assigned to anything again. Try moving the parens so it looks like this:
    $to_addr = ($_ =~ rest of stuff);


    UPDATE
    Oops. Second code section is wrong as DrManhattan pointed out.
      $to_addr = ($_ =~ rest of stuff);

      That will set $to_addr to either 1 or 0 depending on whether the substitution succeeded or not.

      -Matt

Re: strange behavior while parsing sendmail logs
by Anonymous Monk on Jul 19, 2000 at 04:22 UTC

    Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Domain Nodelet?
    Node Status?
    node history
    Node Type: perlquestion [id://23061]
    Approved by root
    help
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others making s'mores by the fire in the courtyard of the Monastery: (4)
    As of 2023-09-22 23:00 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found

      Notices?