Re: Perl Programs that can retrieve email addresses from web pages
by davorg (Chancellor) on Jan 03, 2001 at 20:21 UTC
|
I'm trying to think of a use for such a program that
wouldn't end up with many people receiving emails that
they don't want.
If I think of one, then I'll let you know how to do
it!
--
<http://www.dave.org.uk>
"Perl makes the fun jobs fun
and the boring jobs bearable" - me
| [reply] |
(kudra: and you want this for what purpose?) Re: Perl Programs that can retrieve email addresses from web pages
by kudra (Vicar) on Jan 03, 2001 at 20:21 UTC
|
I'm curious about what a program like this would
be used for, because the only use I can think of is
spam. I don't like the idea of helping anyone do that.
But it's possible you could want this for some other
reason, so
you may as well look at HTTP and FTP clients,
where you should find some answers about getting contents
of web pages and parsing the results. From there you
could either look for 'mailto' or use some inexact
regex to find things which look like email addresses. | [reply] |
Re: Perl Programs that can retrieve email addresses from web pages
by strredwolf (Chaplain) on Jan 03, 2001 at 20:44 UTC
|
For the life of your job and your company, you DON'T
want to do this.
Gathering e-mails off of a page is paramount to creating
"spamware", and there's a ton of that already going around
causing ISP's to kick the creators off the net. It's just
not tolerated. It'll get you blackholed and blackballed
for life.
Instead, go to Mail Abuse Prevention Services and
also Abuse.net. Also drop by news.admin.net-abuse.email
on Usenet and ask for pointers on generating a confirmed opt-in
mailing list.
Update: If you want to see how bad it affects the network,
then Read about how two guys were
sentenced to two years in jail for causing near-havoc with spam.
AOL, AT&T, Mindspring, etc were all affected.
--
$Stalag99{"URL"}="http://stalag99.keenspace.com";
| [reply] |
|
Guess next time I ask a question I'll do it anonymously. This is what a client asked us to do for people who ordered something from there web site. And only those people, thanks for the, um, suggestions on what to do and not to do. I probably should've mentioned that this was for a client before but I was short on time. Thanks again and I'll let my company know the response I got and we'll talk to the client on how to approach this and see if they still want to do it.
| [reply] |
|
This still sounds like it might be a clunky solution to whatever the root problem is. What I would have like to have seen in the original question is more specificity about the requirements. Because most of the information was actually in the title of the question, we were left to assume the details of the requirements for ourselves.
If someone is ordering something via your client's website, you do not need to use any page scrapers or anything else. Simply add a field for e-mail address and an opt-out checkbox to a form during the order process. Salvadors' link below looks to be a good way to validate the email addresses. Store the valid addresses in your customer records along with the other customer data.
| [reply] |
|
Ahh, to the meat of the problem. Definetly, scanning is bad.
However, if your handling the ordering for the client, and
they want an e-mail list, you may want to put in code to
confirm those e-mails for the list itself. I have some code
which I can post (it's in Perl) which can do that. I just
have to remove some proprietary items, make it generic.
Also, always include the option for signing up for the list
default it to "No, I don't want to sign up". You'll get a
lot less complaints that way.
MAPS pushes these ideas, but they've been formed out of consensus
by many sysadmins while being flooded with mail durring the CyberPromo
"era."
--
$Stalag99{"URL"}="http://stalag99.keenspace.com";
| [reply] |
|
(Corion) Getting active (Re: Perl Programs that can retrieve email addresses from web pages)
by Corion (Patriarch) on Jan 03, 2001 at 23:15 UTC
|
Of course, for all those who run a website with CGI programs available, there is Sugarplum, a spam poisoner
that creates fake webs full of generated (and false)
email links. This dosen't of course really combat spam, but
at least it fills up the resources of the would-be spammer
with worthless email addresses.
Sugarplum is written in Perl, of course :-)
| [reply] |
Re: Perl Programs that can retrieve email addresses from web pages
by ichimunki (Priest) on Jan 03, 2001 at 20:24 UTC
|
You might start with HTML::Parser or HTML::TokeParser (the latter is a "simple" version of the former).
Grabbing all <a href="mailto:*"> type flags is the only "ethical" way to grab emails from the web-- and even then, email addresses often show up in such tags without the explicit consent of the owner of the email address. Make sure to validate them, since many of us hate getting email at autostripped addresses and butcher our tags accordingly. My suggestion is to drop all invalid addresses, since that is a signal that these addresses are meant only for human consumption.
Update: I want to second davorg's sentiment below. Only once, a very long time ago, did I ever receive an email from a web bot that was acceptable-- it told me about some broken links on my page... and then, of course, tried to sell me something. Which got the sender into the killfile pretty quickly. | [reply] [d/l] |
|
My suggestion is to drop all invalid addresses, since that is a signal that these addresses are meant only for human consumption.
My suggestion would be not to harvest email addresses
from web pages at all as people should be entitled to put
email addresses on a web page without worrying about
being attacked by spammers.
If you want to get email addresses then ask visitors to
your website to register - but allow them to opt out of
spamming.
--
<http://www.dave.org.uk>
"Perl makes the fun jobs fun
and the boring jobs bearable" - me
| [reply] |
|
I have to disagree with the implication that if an email
address isn't mauled that it is ethical to
grab it from a mailto. In practice, giving your
email address may be akin begging for junk mail, but
in theory (and ethics), I think that's different from
requesting junk email.
Update: strredwolf, exactly what I meant.
I have my address on my site because I want people to
be able to use it, not because I want junk mail.
Update ichimunki, I think we do agree.
I was writing my comment at the same time as Dave wrote
his, and his sums up my point well enough.
| [reply] |
|
| [reply] |
|
I don't really want to get into an ethics debate, and I thought I was pretty clear about what I thought of using harvested emails for the purpose of spam. Soliciting is soliciting, whether electronic, by phone, snail mail, or door-to-door. I am not interested in using the "ethics" club to bludgeon free speech, whether it's opinions I don't like or offers to buy more crap. I should be able to request in any medium that I not be contacted again, once that initial solicitation has been made.
I also think sending anonymous spam should be a felony-- I put it on the same level as cracking passwords without permission-- attempts to subvert systems for unauthorized use. Other than mauling my email address to inhibit simple (or even the new-and-improved) harvesting, I cannot think of a single way to post information in public, and not expect the public to use that information if they want. Does robots.txt have an email solicitation "opt-in" flag?
| [reply] |
|
Re: Perl Programs that can retrieve email addresses from web pages
by clemburg (Curate) on Jan 03, 2001 at 20:51 UTC
|
You might want to take a look at
Mastering Regular Expressions,
page 316, for a regex that matches a valid email address (grin).
Christian Lemburg
Brainbench MVP for Perl
http://www.brainbench.com
| [reply] |
|
| [reply] |
Re: Perl Programs that can retrieve email addresses from web pages
by boo_radley (Parson) on Jan 03, 2001 at 20:21 UTC
|
Update
*deletia*
yeah, fine, so it's probably for spam... sue me for being interested in a problem. | [reply] |