yojimbo has asked for the wisdom of the Perl Monks concerning the following question:
I've been working (for far too long) on a mail-script to
deliver to 50,000 users (more added every week). It's got to
the point where it's grinding the mail-server, particularly
if there are invalid addresses in the list.
It occurred to me that I could optimise this by delivering
to the most frequent addresses first: should save on nslookups,
and frequently-appearing hostnames are more likely to be valid
than the "go@away.no.spam" variety. About 80% of the list
should be processed before I run into the single-user hostnames.
The most frequently occurring hostname is on the same host as
the mailing program, so this 40% of the list should be delivered almost
immediately.
(BTW, the host doesn't use sendmail - I don't get to
choose the MTA.)
Here's the outline of the code I produced, and I'd be grateful
for any comments about how I could do this more efficiently.
It's not bad atm, sorting the 50,000 addresses in ~ 3 secs,
but I'm sure there's another and probably more efficient way
to do it.
# unsorted addresses start off in the @addresses array.
# sorted ones end up in the @recipients array.
foreach my $address (@addresses) {
my ($name, $host) = split(/\@/, $address);
# various checks on the validity of name and host, snipped
push @{$host}, "$name\@$host";
$frequency{$host}++;
}
foreach my $host (reverse sort { $frequency{$a} <=> $frequency{$b} } k
+eys %frequency) {
foreach my $address (@{$host}) {
push @recipients, $address;
}
@{$host} = -1;
}
So I'm naturally concerned about the wild proliferation of
(often one-element) arrays, even though these are exterminated
later. TIA.
Re: Sorting a list by frequency of items
by Trimbach (Curate) on Mar 22, 2001 at 18:06 UTC
|
| [reply] |
|
Yes indeed Mail::Bulkmail does exactally what you need.
It's super fast and even more so when you use the
use_envelope method. What use_envelope does is send
one message to each server and a list of people on that server
who are to receive said message. I also like the merge features
but by using them you loose the ability to call use_envelope
Also see Mail::CheckUser, which allows you to check the format of the email address for correctness, check that domain's DNS for a MX record, and them will connect to destination server and do a VRFY on the user to make sure it's valid, which is a good way to clean up your list.
Jason Miller, Perl Geek
CTO - Digital Campaigns, INC.
| [reply] |
Some thoughts about this.
by yojimbo (Monk) on Mar 22, 2001 at 17:07 UTC
|
I probably shouldn't reply to my own question, but I must confess I'm amazed at
the number of people who have responded with "ha! you're an evil spammer"
rather than helping with the (fairly minor) problem. The one helpful reply has
since been deleted.
I work for an Internet company which provides services to a number of large and
medium-sized ISPs. These people have legitimate reasons to send email to their
customers, and that's what this script does. These people are asked *when they
sign up* whether or not they want the email, and no, the question isn't buried
at the foot of the page where you can't see it. The reason I'm optimising the
script is because these people *complain* when they don't get it on time!
Previously, I worked for a political-lobbying organisation which also sent an
email to 25,000 members each week. This wasn't spam: it was an essential part
of the campaign these people were orchestrating. What I'm getting at here is
this: there are legitimate reasons to send large quantities of emails to
people, and the glib assumption that I'm sending spam is unfounded and, I must
confess, very offensive.
If I had been engaged in spamming, I would have disguised the question. I
didn't, because I have nothing to hide. It didn't occur to me (OK, I'm too
trusting) that anyone would think that I was doing anything anti-social. I
didn't preface the question with "this isn't a spam script" because, even if I
had, I'm sure you'd all be here now saying "Oh yes it is! Why else would you do
this sort of thing?" Same discussion, different route.
Does this mean there are some things which can't be discussed here? It seems so.
I'll make sure I don't post anything about sending email in future, because that
would be irrefutable proof that I'm spamming.
The script works as-is. I had hoped for a better insight into this problem, and
that the discussion might help others -- which is what this place is for,
right?? I would be grateful if anyone with the appropriate special powers could
delete the question, because I think I've just wasted a good deal of my time,
and yours too.
| [reply] |
|
You haven't wasted your time. You got some help, eh? And
the people here are kinda gun-shy since we've been winged
a couple times before.
Also, if you are actually offended at glib assumptions
then you are far too thin-skinned to be walking around
without your rose-colored glasses on. They were right not
to help you if they didn't want and right to call you out
on the spam question and you were right to call em rude for
it. And there are indeed things you can't discuss here.
=) Just mention getting a TCL script to talk to your Scheme
code and then watch the fan get buried...
## tested now (had the sort backwards)
my %domain;
foreach my $address (@addresses) {
my ($name, $host) = split(/\@/, $address);
push @{$domain{$host}},$address
}
foreach my $host (sort { $#{$domain{$b}} <=> $#{$domain{$a}} } keys %d
+omain) {
mailout(@{$domain{$host}});
}
--
$you = new YOU;
honk() if $you->love(perl) | [reply] [d/l] |
|
While I appreciate that you would have disguised the question had your motives been impure, you have to consider your audience. Our experience is that new people wander in here on a weekly -- and sometimes daily -- basis, boldly asking some pretty dubious questions, ranging in type from the barely disguised "please do my homework for me" to "how can I use Perl to h4x0r orbital satellites".
Until you've established a reputation here (and you do seem off to a good start), it's not reasonable to expect people to assume that your motives are pure. Particularly with questions that suggest spamming.
Think of Checkov in the "Save the Whales" Star Trek movie, stopping a policeman on the street to ask "Can you direct me to your nooklear vwesselz?"
| [reply] |
|
| [reply] |
|
Since no one else has thrown in on your side, I thought I'd toss in my 2 Yen. I personally think the response was rediculous, and maybe PM is a little trigger happy.
I'll shoot from the hip myself and guess that any medium sized company could have a mailing list of 50,000 addresses...and be more common than Spamers. I work for a company with 20 people and even we have a couple thousand addresses just for our customers, not counting our potential clients.
Even though I've never had any problems with Spam mail, I could certainly understand some caution...but /msg-ing in chatbox and then saying you still don't believe him?
-Lexicon
| [reply] |
(tye)Re: Sorting a list by frequency of items
by tye (Sage) on Mar 22, 2001 at 20:18 UTC
|
First, I wouldn't use symbolic references like that. Use a hash. That is, instead of push @{$host} use push @{$names{$host}} where you declare my %names earlier. So please add use strict to you code so these things will be caught.
Then you could replace the push line with something that doesn't resort to an array until needed:
if( ! $names{$hash} ) {
$names{$hash}= $name;
} elsif( ! ref( $users{$hash} ) ) {
$users{$hash}= [ $users{$hash}, $name ];
} else {
push @{ $users{$hash} }, $name;
}
or you could just cram all of the names into one long string:
$users{$hash} .= $name.$;;
or perhaps just removing the namespace pollution is enough of an improvement for you?
After thinking of several other alternates, I'd probably go with "ease of programming" and use what I outlined in the first paragraph and not sweat the tons of anonymous, 1-element arrays.
-
tye
(but my friends call me "Tye") | [reply] [d/l] [select] |
Re: Sorting a list by frequency of items
by chipmunk (Parson) on Mar 22, 2001 at 21:43 UTC
|
Two quick suggestions. First, instead of sorting the list of keys and then reversing it, sort in the proper order to begin with. Second, foreach (LIST) { push ARRAY, $_ } can be rewritten as push ARRAY LIST.
foreach my $host (sort { $frequency{$b} <=> $frequency{$a} } keys %fre
+quency) {
push @recipients, @$host;
}
| [reply] [d/l] |
Re: Sorting a list by frequency of items
by davorg (Chancellor) on Mar 22, 2001 at 15:49 UTC
|
I have a natural nervousness about helping out on
problems that involve sending out 50,000 emails as I can't
help wondering how many of those 50,000 recipients actually
wanted to see that email.
You might get a lot more help here if you can convince
us that you're not sending spam.
--
<http://www.dave.org.uk>
"Perl makes the fun jobs fun
and the boring jobs bearable" - me
| [reply] |
|
| [reply] |
|
Well, many people are going to make those kinds of
assumptions if you don't tell them otherwise. Sad but
true!
You would probably have got a different reaction if you'd
have said something like "I know this sounds like spam, but
it's really an opt-in mailing list".
--
<http://www.dave.org.uk>
"Perl makes the fun jobs fun
and the boring jobs bearable" - me
| [reply] |
|
Re: Sorting a list by frequency of items
by flocto (Pilgrim) on Mar 23, 2001 at 08:11 UTC
|
Don't be too angry with people that get tonns of spam every single day and don't want every 13-year-old-highschool-scriptkiddy being able to send tonns of mail WITH their help. I (and nobody else) had doubt that there exist good reasons for sending huge amounts of mail.. So here's a solution that should be kinda efficent:
my %hash_addresses;
foreach my $address (@all_addresses)
{
$hash_addresses{$address}++;
}
foreach my $address (sort { $hash_addresses{$b} <=> $hash_addresses{$a
+} } (keys %hash_addresses))
{
send_mail ($address);
}
You will have to write the &send_mail routine by yourself :)
Regards, octo
--
GED/CC d-- s:- a--- C++(+++) UL+++ P++++$ L++>++++ E--- W+++@ N o? K? w-- O- M-(+) V? !PS !PE !Y PGP+(++) t-- 5 X+ R+(+++) tv+(++) b++@ DI+() D+ G++ e->+++ h!++ r+(++) y+ | [reply] [d/l] [select] |
Re: Sorting a list by frequency of items
by yojimbo-san (Initiate) on Mar 23, 2001 at 19:17 UTC
|
This is a sort of non-perl answer, and it may be wholly inapplicable to you.
However, the result your're looking for (minimising nslookups in the MTA) is pretty much only achievable within the MTA itself. I suggest exim http://www.exim.org - it already understands the concept of caching answers and multiple emails destined to the same recipient MTA within the mail queue (and this is smarter than the "domain name" choice, it's based on the MX for that domain)
(p.s. nice name, yojimbo! )
Quick wafting zephyrs vex bold Jim
| [reply] |
|
Offishelly, I don't get to choose the MTA, but talking to a
couple of other people here, they also send out bulk mails like
this to other ISPs so I might be able to set up a dedicated
box for this: the server's mail queue is still backed up even
after all the work optimising the script :-( If I can sell this
idea to management and a few other techs, your recommendation
might be very useful.
(PS: kewl name yourself :-)
| [reply] |
|
My vote is for postfix.
I've used postfix on several machines. In my
experience throughput of a
well tuned postfix can be more than 8x what I could
get with sendmail on the same hardware. Postfix is also much easier than sendmail to configure (not to mention much more secure).
Postfx doesn't have all the features that sendmail does, but it has plenty.
I've never needed a feature that postfix didn't have.
Either way you go I recomend that you throttle your mail generator so it does
not overload the mailserver's incoing queue. In my experience MTA performance
(postfix, sendmail, or qmail) starts to suffer when the incoming
queue gets too large.
| [reply] |
|
|
My choice : Sendmail most used, well known, stable ,very portable, and powerful..
And even if my opinion is biased
(Im a sendmail administrator) I think sendmail isn't so complicated to configure/manage through the m4 macros.
Alternative choice : Postfix
A little bit simpler and said to be more secure and faster
(even if I'm not aware of any Senmail vulnerability in the current version...)
In fact it depends on your objective :
If you want something fast and easy to install to do simple things I think you should use postfix.
Now if you want to hook some stuffs in your MTA (lots of already developped add-on exists),
Or if you know you'll do complex things with your MTA, try Sendmail.
"Trying to be a SMART lamer" (thanx to Merlyn ;-)
| [reply] |
Re: Sorting a list by frequency of items
by jeroenes (Priest) on Mar 22, 2001 at 16:01 UTC
|
(node deleted, at least for now. Too bad
yojimbo can't divulge more information. He
probably is not a spammer after all, but how
would you know for sure? I think I would
have felt offended in his place, by these kind
of reactions. But than, I don't have such mailing lists.
)
| [reply] |
|
So you /msg me to ask for more information, then when I do
post some more you say you can't believe me! Help like this
I don't need. Or maybe you'd just like me to post my SSH
key here so you and everyone else can check out the script
and it's environment in person??
Hint: I mention in the original message that 40% of the
users are on the same box as the system. If I was spamming
addresses harvested from usenet or web-pages, this wouldn't be
true, would it?
Good old Monty Python...
CROWD: A witch! A witch! A witch! We've got a witch! A witch!
VILLAGER #1: We have found a witch, might we burn her?
CROWD: Burn her! Burn!
BEDEMIR: How do you know she is a witch?
VILLAGER #2: She looks like one.
BEDEMIR: Bring her forward.
WITCH: I'm not a witch. I'm not a witch.
BEDEMIR: But you are dressed as one.
WITCH: They dressed me up like this.
CROWD: No, we didn't... no.
WITCH: And this isn't my nose, it's a false one.
BEDEMIR: Well?
VILLAGER #1: Well, we did do the nose.
BEDEMIR: The nose?
VILLAGER #1: And the hat -- but she is a witch!
CROWD: Burn her! Witch! Witch! Burn her!
| [reply] |
Re: Sorting a list by frequency of items
by willdooUK (Beadle) on Mar 22, 2001 at 15:52 UTC
|
(node changed). Well, my apologies. I guess I had just gotten too many "make $$$$ now" mails today, and jumped the gun a bit.
| [reply] |
|
|