Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Sorting a list by frequency of items

by yojimbo (Monk)
on Mar 22, 2001 at 15:38 UTC ( [id://66302]=perlquestion: print w/replies, xml ) Need Help??

yojimbo has asked for the wisdom of the Perl Monks concerning the following question:

I've been working (for far too long) on a mail-script to deliver to 50,000 users (more added every week). It's got to the point where it's grinding the mail-server, particularly if there are invalid addresses in the list.

It occurred to me that I could optimise this by delivering to the most frequent addresses first: should save on nslookups, and frequently-appearing hostnames are more likely to be valid than the "go@away.no.spam" variety. About 80% of the list should be processed before I run into the single-user hostnames. The most frequently occurring hostname is on the same host as the mailing program, so this 40% of the list should be delivered almost immediately. (BTW, the host doesn't use sendmail - I don't get to choose the MTA.)

Here's the outline of the code I produced, and I'd be grateful for any comments about how I could do this more efficiently. It's not bad atm, sorting the 50,000 addresses in ~ 3 secs, but I'm sure there's another and probably more efficient way to do it.

# unsorted addresses start off in the @addresses array. # sorted ones end up in the @recipients array. foreach my $address (@addresses) { my ($name, $host) = split(/\@/, $address); # various checks on the validity of name and host, snipped push @{$host}, "$name\@$host"; $frequency{$host}++; } foreach my $host (reverse sort { $frequency{$a} <=> $frequency{$b} } k +eys %frequency) { foreach my $address (@{$host}) { push @recipients, $address; } @{$host} = -1; }

So I'm naturally concerned about the wild proliferation of (often one-element) arrays, even though these are exterminated later. TIA.

Replies are listed 'Best First'.
Re: Sorting a list by frequency of items
by Trimbach (Curate) on Mar 22, 2001 at 18:06 UTC
    For legitimate bulkmail you might check into Mail::Bulkmail on CPAN. I haven't used it but I've looked into it before and it looks as if it does something like you're interested in, specifically optimizing the mail sending by batching recipients on the same domain (so, for example, it sends all the yahoo.com folks at once instead of individually.) It's suppossed to be very fast, and might be what you need.

    Gary Blackburn
    Trained Killer

      Yes indeed Mail::Bulkmail does exactally what you need. It's super fast and even more so when you use the use_envelope method. What use_envelope does is send one message to each server and a list of people on that server who are to receive said message. I also like the merge features but by using them you loose the ability to call use_envelope

      Also see Mail::CheckUser, which allows you to check the format of the email address for correctness, check that domain's DNS for a MX record, and them will connect to destination server and do a VRFY on the user to make sure it's valid, which is a good way to clean up your list.

      Jason Miller, Perl Geek
      CTO - Digital Campaigns, INC.

Some thoughts about this.
by yojimbo (Monk) on Mar 22, 2001 at 17:07 UTC
    I probably shouldn't reply to my own question, but I must confess I'm amazed at the number of people who have responded with "ha! you're an evil spammer" rather than helping with the (fairly minor) problem. The one helpful reply has since been deleted.

    I work for an Internet company which provides services to a number of large and medium-sized ISPs. These people have legitimate reasons to send email to their customers, and that's what this script does. These people are asked *when they sign up* whether or not they want the email, and no, the question isn't buried at the foot of the page where you can't see it. The reason I'm optimising the script is because these people *complain* when they don't get it on time!

    Previously, I worked for a political-lobbying organisation which also sent an email to 25,000 members each week. This wasn't spam: it was an essential part of the campaign these people were orchestrating. What I'm getting at here is this: there are legitimate reasons to send large quantities of emails to people, and the glib assumption that I'm sending spam is unfounded and, I must confess, very offensive.

    If I had been engaged in spamming, I would have disguised the question. I didn't, because I have nothing to hide. It didn't occur to me (OK, I'm too trusting) that anyone would think that I was doing anything anti-social. I didn't preface the question with "this isn't a spam script" because, even if I had, I'm sure you'd all be here now saying "Oh yes it is! Why else would you do this sort of thing?" Same discussion, different route.

    Does this mean there are some things which can't be discussed here? It seems so. I'll make sure I don't post anything about sending email in future, because that would be irrefutable proof that I'm spamming.

    The script works as-is. I had hoped for a better insight into this problem, and that the discussion might help others -- which is what this place is for, right?? I would be grateful if anyone with the appropriate special powers could delete the question, because I think I've just wasted a good deal of my time, and yours too.

      You haven't wasted your time. You got some help, eh? And the people here are kinda gun-shy since we've been winged a couple times before.

      Also, if you are actually offended at glib assumptions then you are far too thin-skinned to be walking around without your rose-colored glasses on. They were right not to help you if they didn't want and right to call you out on the spam question and you were right to call em rude for it. And there are indeed things you can't discuss here. =) Just mention getting a TCL script to talk to your Scheme code and then watch the fan get buried...

      ## tested now (had the sort backwards) my %domain; foreach my $address (@addresses) { my ($name, $host) = split(/\@/, $address); push @{$domain{$host}},$address } foreach my $host (sort { $#{$domain{$b}} <=> $#{$domain{$a}} } keys %d +omain) { mailout(@{$domain{$host}}); }

      --
      $you = new YOU;
      honk() if $you->love(perl)

      While I appreciate that you would have disguised the question had your motives been impure, you have to consider your audience. Our experience is that new people wander in here on a weekly -- and sometimes daily -- basis, boldly asking some pretty dubious questions, ranging in type from the barely disguised "please do my homework for me" to "how can I use Perl to h4x0r orbital satellites".

      Until you've established a reputation here (and you do seem off to a good start), it's not reasonable to expect people to assume that your motives are pure. Particularly with questions that suggest spamming.

      Think of Checkov in the "Save the Whales" Star Trek movie, stopping a policeman on the street to ask "Can you direct me to your nooklear vwesselz?"

        Thanks for taking the time to put me in the picture (and thanks also to "extreme"). I've spoken to a few people here about how the posting does press the wrong buttons - that "nospam" style email address is a bad example (although, believe it or not, some people do choose to enter that even though they're not required to sign up to the list).

        Anyway, next time I post something which looks a bit suspect I'll put in some background as well.

      Since no one else has thrown in on your side, I thought I'd toss in my 2 Yen. I personally think the response was rediculous, and maybe PM is a little trigger happy.

      I'll shoot from the hip myself and guess that any medium sized company could have a mailing list of 50,000 addresses...and be more common than Spamers. I work for a company with 20 people and even we have a couple thousand addresses just for our customers, not counting our potential clients.

      Even though I've never had any problems with Spam mail, I could certainly understand some caution...but /msg-ing in chatbox and then saying you still don't believe him?

      -Lexicon

(tye)Re: Sorting a list by frequency of items
by tye (Sage) on Mar 22, 2001 at 20:18 UTC

    First, I wouldn't use symbolic references like that. Use a hash. That is, instead of push @{$host} use push @{$names{$host}} where you declare my %names earlier. So please add use strict to you code so these things will be caught.

    Then you could replace the push line with something that doesn't resort to an array until needed:

    if( ! $names{$hash} ) { $names{$hash}= $name; } elsif( ! ref( $users{$hash} ) ) { $users{$hash}= [ $users{$hash}, $name ]; } else { push @{ $users{$hash} }, $name; }
    or you could just cram all of the names into one long string:     $users{$hash} .= $name.$;; or perhaps just removing the namespace pollution is enough of an improvement for you?

    After thinking of several other alternates, I'd probably go with "ease of programming" and use what I outlined in the first paragraph and not sweat the tons of anonymous, 1-element arrays.

            - tye (but my friends call me "Tye")
Re: Sorting a list by frequency of items
by chipmunk (Parson) on Mar 22, 2001 at 21:43 UTC
    Two quick suggestions. First, instead of sorting the list of keys and then reversing it, sort in the proper order to begin with. Second, foreach (LIST) { push ARRAY, $_ } can be rewritten as push ARRAY LIST.
    foreach my $host (sort { $frequency{$b} <=> $frequency{$a} } keys %fre +quency) { push @recipients, @$host; }
Re: Sorting a list by frequency of items
by davorg (Chancellor) on Mar 22, 2001 at 15:49 UTC

    I have a natural nervousness about helping out on problems that involve sending out 50,000 emails as I can't help wondering how many of those 50,000 recipients actually wanted to see that email.

    You might get a lot more help here if you can convince us that you're not sending spam.

    --
    <http://www.dave.org.uk>

    "Perl makes the fun jobs fun
    and the boring jobs bearable" - me

      I don't think that's a fair comment :-P These are *subscribers* to an ISP who do, indeed, ask users to sign up rather than just spam them.

      But how do I convince you I'm not sending spam? You'll just have to take my word on it, which is why I didn't mention it in the first place.

        Well, many people are going to make those kinds of assumptions if you don't tell them otherwise. Sad but true!

        You would probably have got a different reaction if you'd have said something like "I know this sounds like spam, but it's really an opt-in mailing list".

        --
        <http://www.dave.org.uk>

        "Perl makes the fun jobs fun
        and the boring jobs bearable" - me

Re: Sorting a list by frequency of items
by flocto (Pilgrim) on Mar 23, 2001 at 08:11 UTC
    Don't be too angry with people that get tonns of spam every single day and don't want every 13-year-old-highschool-scriptkiddy being able to send tonns of mail WITH their help. I (and nobody else) had doubt that there exist good reasons for sending huge amounts of mail.. So here's a solution that should be kinda efficent:
    my %hash_addresses; foreach my $address (@all_addresses) { $hash_addresses{$address}++; } foreach my $address (sort { $hash_addresses{$b} <=> $hash_addresses{$a +} } (keys %hash_addresses)) { send_mail ($address); }
    You will have to write the &send_mail routine by yourself :)
    Regards, octo

    --
    GED/CC d-- s:- a--- C++(+++) UL+++ P++++$ L++>++++ E--- W+++@ N o? K? w-- O- M-(+) V? !PS !PE !Y PGP+(++) t-- 5 X+ R+(+++) tv+(++) b++@ DI+() D+ G++ e->+++ h!++ r+(++) y+
Re: Sorting a list by frequency of items
by yojimbo-san (Initiate) on Mar 23, 2001 at 19:17 UTC

    This is a sort of non-perl answer, and it may be wholly inapplicable to you.

    However, the result your're looking for (minimising nslookups in the MTA) is pretty much only achievable within the MTA itself. I suggest exim http://www.exim.org - it already understands the concept of caching answers and multiple emails destined to the same recipient MTA within the mail queue (and this is smarter than the "domain name" choice, it's based on the MX for that domain)

    (p.s. nice name, yojimbo! )

    Quick wafting zephyrs vex bold Jim

      Offishelly, I don't get to choose the MTA, but talking to a couple of other people here, they also send out bulk mails like this to other ISPs so I might be able to set up a dedicated box for this: the server's mail queue is still backed up even after all the work optimising the script :-( If I can sell this idea to management and a few other techs, your recommendation might be very useful. (PS: kewl name yourself :-)
        My vote is for postfix. I've used postfix on several machines. In my experience throughput of a well tuned postfix can be more than 8x what I could get with sendmail on the same hardware. Postfix is also much easier than sendmail to configure (not to mention much more secure). Postfx doesn't have all the features that sendmail does, but it has plenty. I've never needed a feature that postfix didn't have.

        Either way you go I recomend that you throttle your mail generator so it does not overload the mailserver's incoing queue. In my experience MTA performance (postfix, sendmail, or qmail) starts to suffer when the incoming queue gets too large.

        My choice : Sendmail most used, well known, stable ,very portable, and powerful..
        And even if my opinion is biased (Im a sendmail administrator)
        I think sendmail isn't so complicated to configure/manage through the m4 macros.

        Alternative choice : Postfix
        A little bit simpler and said to be more secure and faster
        (even if I'm not aware of any Senmail vulnerability in the current version...)

        In fact it depends on your objective :
        If you want something fast and easy to install to do simple things I think you should use postfix.
        Now if you want to hook some stuffs in your MTA (lots of already developped add-on exists),
        Or if you know you'll do complex things with your MTA, try Sendmail.

        "Trying to be a SMART lamer" (thanx to Merlyn ;-)
Re: Sorting a list by frequency of items
by jeroenes (Priest) on Mar 22, 2001 at 16:01 UTC
    (node deleted, at least for now. Too bad yojimbo can't divulge more information. He probably is not a spammer after all, but how would you know for sure? I think I would have felt offended in his place, by these kind of reactions. But than, I don't have such mailing lists. )
      So you /msg me to ask for more information, then when I do post some more you say you can't believe me! Help like this I don't need. Or maybe you'd just like me to post my SSH key here so you and everyone else can check out the script and it's environment in person??

      Hint: I mention in the original message that 40% of the users are on the same box as the system. If I was spamming addresses harvested from usenet or web-pages, this wouldn't be true, would it?


      Good old Monty Python...


      CROWD: A witch! A witch! A witch! We've got a witch! A witch!
      VILLAGER #1: We have found a witch, might we burn her?
      CROWD: Burn her! Burn!
      BEDEMIR: How do you know she is a witch?
      VILLAGER #2: She looks like one.
      BEDEMIR: Bring her forward.
      WITCH: I'm not a witch. I'm not a witch.
      BEDEMIR: But you are dressed as one.
      WITCH: They dressed me up like this.
      CROWD: No, we didn't... no.
      WITCH: And this isn't my nose, it's a false one.
      BEDEMIR: Well?
      VILLAGER #1: Well, we did do the nose.
      BEDEMIR: The nose?
      VILLAGER #1: And the hat -- but she is a witch!
      CROWD: Burn her! Witch! Witch! Burn her!

Re: Sorting a list by frequency of items
by willdooUK (Beadle) on Mar 22, 2001 at 15:52 UTC
    (node changed).
    Well, my apologies.
    I guess I had just gotten too many "make $$$$ now" mails today, and jumped the gun a bit.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://66302]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2024-04-19 07:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found