Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

finding emails

by bdawg613 (Initiate)
on Nov 11, 2002 at 22:25 UTC ( #212124=perlquestion: print w/replies, xml ) Need Help??
bdawg613 has asked for the wisdom of the Perl Monks concerning the following question:

i don't know if anyone can help, but i've posted all my code so that it might help. I'm trying to find emails in a file, my problem is, it works for all the different formats that i've specified BUT it only works if all the emails are on seperate lines, as soon as i have multiple emails on the same line in the text file, it "might" find the first occurence (it usually does) but the rest of the emails on that line are ignored, this is my code, i hope someone can help:
#open the 2 files for input and the second for output open (FIN, "@ARGV[0]") || (die "No input file"); while(<FIN>){ if(s#((\b\w*\b\s)(\b\w*\b\s)?(\b\w*\b\s)(\<)(\b\w*((\.)?(\w*)? +)*[@]\w*\.\w*((\.)?(\w*)?)*\b)(\>))# $1 .($var =$2.$3.$4."\t".$6) #eg +){} elsif(s#((\b\w*\b)(\,\s)(\b\w*\b\s)(\b\w*\b\s)?((\<)(\b\w*((\. +)?(\w*)?)*[@]\w*\.\w*((\.)?(\w*)?)*\b)(\>)))# $1.($var = $4.$5.$2."\t +".$8) #eg){} elsif(s#((\")(\b\w*\b\s)(\b\w*\b\s)?(\b\w*\b)(\")(\s)(\<)(\b\w +*((\.)?(\w*)?)*[@]\w*\.\w*((\.)?(\w*)?)*\b)(\>))# $1.($var =$3.$4.$5. +"\t".$9) #eg){} elsif(s#((\")(\b\w*\b)(\,\s)(\b\w*\b)(\s\b\w*\b)?(\")(\s)((\<) +(\b\w*((\.)?(\w*)?)*[@]\w*\.\w*((\.)?(\w*)?)*\b)(\>)))# $1.($var = $5 +." ".$6.$3."\t".$11) #eg){} elsif(s#((\")(\b\w*\b\s)(\b\w*\b\s)?(\b\w*\b)(\")(\s)(\[)(\b\w +*((\.)?(\w*)?)*[@]\w*\.\w*((\.)?(\w*)?)*\b)(\]))# $1.($var =$3.$4.$5. +"\t".$9) #eg){} elsif(s#((\")(\b\w*\b)(\,\s)(\b\w*\b)(\s\b\w*\b)?(\")(\s)((\[) +(\b\w*((\.)?(\w*)?)*[@]\w*\.\w*((\.)?(\w*)?)*\b)(\])))# $1.($var = $5 +." ".$6.$3."\t".$11) #eg){} elsif(s#((\")(\b\w*\b\s)(\b\w*\b\s)?(\b\w*\b\s)(\[)(\b\w*((\.) +?(\w*)?)*[@]\w*\.\w*((\.)?(\w*)?)*\b)(\])(\"))# $1.($var =$3.$4.$5."\ +t".$7) #eg){} elsif(s#((\")(\b\w*\b)(\,\s)(\b\w*\b)(\s\b\w*\b)?(\s)((\[)(\b\ +w*((\.)?(\w*)?)*[@]\w*\.\w*((\.)?(\w*)?)*\b)(\])(\")))# $1.($var = $5 +." ".$6.$3."\t".$10) #eg){} #hash my emails $hashed{$var}++; } #print out all the emails for $var (sort keys %hashed){ print"$var\n"; }

Replies are listed 'Best First'.
Re: finding emails
by tachyon (Chancellor) on Nov 11, 2002 at 23:08 UTC

    Incomprehensible. Perhaps start here...

    while (<DATA>) { my @emails = $_ =~ m/([^\s<>]+\@[^\s<>]+)/g; $emails{$_}++ for @emails; } use Data::Dumper; print Dumper \%emails; __DATA__<> bar@ppp.ipsec jfreeman@[nospam]




Re: finding emails
by IlyaM (Parson) on Nov 11, 2002 at 23:23 UTC
Re: finding emails
by Enlil (Parson) on Nov 11, 2002 at 22:57 UTC
    The if/elsif statements evaluate to true if the regular expression is true, and once it is true it goes through the stuff in the block, and skips the rest of your elsif statements (as well as any other matches in that line, as it is never told to look for another one, even though the g is there it never returns). You might want each individual regular expression in its own while loop (I am sure there are more efficient ways of doing this.) and then passing the resulting values to a hash. for example:

    while ( <FIN>) { while ( s#first_expression#do_something#eg ){ $hashed{$var++}; } #more loops here }

    This way the loop will continue while there exists anything else in the text that matches the regular expression.

    Update:Another thing I was thinking of is you could split the stuff in the line into "words" and then run the expressions against each "word", as I will assume that each of your emails is seperated by something or other (be it a space, or something else), and then you won't have the problem of the second e-mail not matching as each "word" will at most have one e-mail in it.


Re: finding emails
by rbc (Curate) on Nov 11, 2002 at 23:26 UTC
    If you are like me and cannot come up with a clever single regex pattern
    that satifies all the possible patterns you can resort to
    something like I did for finding various dates in a text file.
    #!/usr/bin/perl -w use strict; my $lineN = 1; while(<DATA>){ for my $dateFormat ( qw ( \D\(\d\d\/\d\d\/\d\d\)\D \D\(\d\d\/\d\d\/\d\d\d\d\)\D \D\(\d\d\d\d/\d\d\/\d\d\)\D \D\(\d\d\d\d-\d\d-\d\d\)\D \D\(\d\d-\d\d-\d\d\)\D \D\(\d\d-\d\d-\d\d\d\d\)\D ^\(\d\d\/\d\d\/\d\d\)\D ^\(\d\d\/\d\d\/\d\d\d\d\)\D ^\(\d\d\d\d/\d\d\/\d\d\)\D ^\(\d\d\d\d-\d\d-\d\d\)\D ^\(\d\d-\d\d-\d\d\)\D ^\(\d\d-\d\d-\d\d\d\d\)\D \D\(\d\d\/\d\d\/\d\d\)$ \D\(\d\d\/\d\d\/\d\d\d\d\)$ \D\(\d\d\d\d/\d\d\/\d\d\)$ \D\(\d\d\d\d-\d\d-\d\d\)$ \D\(\d\d-\d\d-\d\d\)$ \D\(\d\d-\d\d-\d\d\d\d\)$ ) ) { my @dates = ( /$dateFormat/g); for ( my $i=0; $i<=$#dates; $i++ ) { my $date = $dates[$i]; print "Found $date on line $lineN\n"; } } $lineN++; } __DATA__ On 11/11/02 I sent a email and it didn't get there until 11/15/2002 Sometime there is no date. But then again there one date like 12/31/99 on the line and this 123/12/2002 is not a date. maybe

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://212124]
Approved by tachyon
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (4)
As of 2018-06-23 09:28 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (125 votes). Check out past polls.