Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Finding unique email addresses (was: Help!)

by vxp (Pilgrim)
on Jul 29, 2002 at 13:19 UTC ( [id://185968]=perlquestion: print w/replies, xml ) Need Help??

vxp has asked for the wisdom of the Perl Monks concerning the following question:

#!/usr/bin/perl use Email::Find; $file = shift; open (FILE, $file) or die "Couldn't open filename: "; while (<FILE>) { $text = $_; my $finder = Email::Find->new(sub { my($email, $orig_email) = @_; print "".$email->format."\n"; return $orig_email; }); $finder->find(\$text);
however, i need _unique_ addresses. in the bounce file that i am parsing with this i could have some@address.com on line #478 and then the same email address one thousand lines down the file. i want (or, rather, need) only one occurence of each address. help please :-)

edited: Mon Jul 29 13:39:54 2002 by jeffa - title change

Replies are listed 'Best First'.
Re: Help!
by flocto (Pilgrim) on Jul 29, 2002 at 13:33 UTC

    Since you told me in the Chatterbox that you have huge amounts of email-addresses, I'd suggest you use some kind of database. DBI could be used to save your data to a "real" database. If this is overkill, you might as well use a flatfile database, AnyDBM_File for example. As suggested above: You can use a hash as well, but then you might run out of memory..

    Regards,
    -octo

      Great idea .. and you could even do something clever and separate the address at the '@' if you've seen that domain before and make the database store
      • domain -> foo.com
      • user -> james.barr:mark.bazz
      It all depends on how much RAM you have, how much DB space, and so forth.

      --t. alex

      "Mud, mud, glorious mud. Nothing quite like it for cooling the blood!"
      --Michael Flanders and Donald Swann

      Update: It's debatable if "using a database is an overkill". The data file is 300M daily, which is getting large. You could use a temporary database table and have MySQL worry about optimization and memory management rather than rely on one of the DBD modules.

        I think using a database is an overkill, why not simply use a hash tied to a DB file?
        It's much easier than installing a Database and DBI modules to suit.
        It'll most likely faster due to the simple nature of the datastructure.

        --

        Brother Frankus.

        ¤

Re: Finding Unique E-Mail addresses
by talexb (Chancellor) on Jul 29, 2002 at 13:27 UTC
    If you are processing a list of E-Mail addresses, you can find unique addresses by them into a hash. At the end of the file, dump them out.

    If that's not what you are doing (hard to tell from this fragment), please explain.

    --t. alex

    "Mud, mud, glorious mud. Nothing quite like it for cooling the blood!"
    --Michael Flanders and Donald Swann

Re: Help!
by twerq (Deacon) on Jul 29, 2002 at 13:31 UTC
    A quick-and-dirty fix could be to keep each "seen" email address as a key in a hash (which could get very large, but hey--).

    This could get you going:
    #!/usr/bin/perl
    use Email::Find;
    
    my %seen_emails;  # this could get big!
    $file = shift;
    open (FILE, $file) or die "Couldn't open filename: ";
    while (<FILE>)
    {
     $text = $_;
    
     my $finder = Email::Find->new(sub 
     {
         my($email, $orig_email) = @_;
         my $formatted_email = $email->format;
         if (!defined($seen_emails{$formatted_email})) {
              # remember we've seen this guy
              $seen_emails{$formatted_email} = 1;
              
              # and show this email addy
              print $formatted_email . "\n";
         }
         return $orig_email;
     });
    
     $finder->find(\$text);
    
    


    (this code is untested)

    --twerq
Re: Finding unique email addresses (was: Help!)
by vxp (Pilgrim) on Jul 29, 2002 at 14:17 UTC
    arrrrrggggg!!!! not all ISPs have a sensible bounce msg as it appears. AOL, for example, doesnt include a complete email address, such as luser2002@aol.com but only the userid, luser2002 ... i completely forgot about that. Sorry response-newsletter@lists.katrillion.com. Your mail to the following recipients could not be delivered because they are not accepting mail with attachments or embedded images: micehatr the above is from AOL. any suggestions on how to handle this? (BEFORE you suggest a bunch of regex's take a look at the parsembox in the code section that i posted yesterday. i am trying to make this, parsembox2, BETTER than the first one) thanks for any input!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://185968]
Approved by talexb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (5)
As of 2024-07-18 10:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.