Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Reaped: Perl Programs that can retrieve email addresses from web pages

by NodeReaper (Curate)
on Jan 03, 2001 at 20:05 UTC ( [id://49528]=perlquestion: print w/replies, xml ) Need Help??

NodeReaper has asked for the wisdom of the Perl Monks concerning the following question:

This node falls below the community's threshold of quality. You may see it by logging in.
  • Comment on Reaped: Perl Programs that can retrieve email addresses from web pages

Replies are listed 'Best First'.
Re: Perl Programs that can retrieve email addresses from web pages
by davorg (Chancellor) on Jan 03, 2001 at 20:21 UTC

    I'm trying to think of a use for such a program that wouldn't end up with many people receiving emails that they don't want.

    If I think of one, then I'll let you know how to do it!

    --
    <http://www.dave.org.uk>

    "Perl makes the fun jobs fun
    and the boring jobs bearable" - me

(kudra: and you want this for what purpose?) Re: Perl Programs that can retrieve email addresses from web pages
by kudra (Vicar) on Jan 03, 2001 at 20:21 UTC
    I'm curious about what a program like this would be used for, because the only use I can think of is spam. I don't like the idea of helping anyone do that.

    But it's possible you could want this for some other reason, so you may as well look at HTTP and FTP clients, where you should find some answers about getting contents of web pages and parsing the results. From there you could either look for 'mailto' or use some inexact regex to find things which look like email addresses.

Re: Perl Programs that can retrieve email addresses from web pages
by strredwolf (Chaplain) on Jan 03, 2001 at 20:44 UTC
    For the life of your job and your company, you DON'T want to do this.

    Gathering e-mails off of a page is paramount to creating "spamware", and there's a ton of that already going around causing ISP's to kick the creators off the net. It's just not tolerated. It'll get you blackholed and blackballed for life.

    Instead, go to Mail Abuse Prevention Services and also Abuse.net. Also drop by news.admin.net-abuse.email on Usenet and ask for pointers on generating a confirmed opt-in mailing list.

    Update: If you want to see how bad it affects the network, then Read about how two guys were sentenced to two years in jail for causing near-havoc with spam. AOL, AT&T, Mindspring, etc were all affected.

    --
    $Stalag99{"URL"}="http://stalag99.keenspace.com";

      Guess next time I ask a question I'll do it anonymously. This is what a client asked us to do for people who ordered something from there web site. And only those people, thanks for the, um, suggestions on what to do and not to do. I probably should've mentioned that this was for a client before but I was short on time. Thanks again and I'll let my company know the response I got and we'll talk to the client on how to approach this and see if they still want to do it.
        This still sounds like it might be a clunky solution to whatever the root problem is. What I would have like to have seen in the original question is more specificity about the requirements. Because most of the information was actually in the title of the question, we were left to assume the details of the requirements for ourselves.

        If someone is ordering something via your client's website, you do not need to use any page scrapers or anything else. Simply add a field for e-mail address and an opt-out checkbox to a form during the order process. Salvadors' link below looks to be a good way to validate the email addresses. Store the valid addresses in your customer records along with the other customer data.

        Ahh, to the meat of the problem. Definetly, scanning is bad.

        However, if your handling the ordering for the client, and they want an e-mail list, you may want to put in code to confirm those e-mails for the list itself. I have some code which I can post (it's in Perl) which can do that. I just have to remove some proprietary items, make it generic.

        Also, always include the option for signing up for the list default it to "No, I don't want to sign up". You'll get a lot less complaints that way.

        MAPS pushes these ideas, but they've been formed out of consensus by many sysadmins while being flooded with mail durring the CyberPromo "era."

        --
        $Stalag99{"URL"}="http://stalag99.keenspace.com";

(Corion) Getting active (Re: Perl Programs that can retrieve email addresses from web pages)
by Corion (Patriarch) on Jan 03, 2001 at 23:15 UTC

    Of course, for all those who run a website with CGI programs available, there is Sugarplum, a spam poisoner that creates fake webs full of generated (and false) email links. This dosen't of course really combat spam, but at least it fills up the resources of the would-be spammer with worthless email addresses.

    Sugarplum is written in Perl, of course :-)

Re: Perl Programs that can retrieve email addresses from web pages
by ichimunki (Priest) on Jan 03, 2001 at 20:24 UTC
    You might start with HTML::Parser or HTML::TokeParser (the latter is a "simple" version of the former).

    Grabbing all <a href="mailto:*"> type flags is the only "ethical" way to grab emails from the web-- and even then, email addresses often show up in such tags without the explicit consent of the owner of the email address. Make sure to validate them, since many of us hate getting email at autostripped addresses and butcher our tags accordingly. My suggestion is to drop all invalid addresses, since that is a signal that these addresses are meant only for human consumption.

    Update: I want to second davorg's sentiment below. Only once, a very long time ago, did I ever receive an email from a web bot that was acceptable-- it told me about some broken links on my page... and then, of course, tried to sell me something. Which got the sender into the killfile pretty quickly.
      My suggestion is to drop all invalid addresses, since that is a signal that these addresses are meant only for human consumption.

      My suggestion would be not to harvest email addresses from web pages at all as people should be entitled to put email addresses on a web page without worrying about being attacked by spammers.

      If you want to get email addresses then ask visitors to your website to register - but allow them to opt out of spamming.

      --
      <http://www.dave.org.uk>

      "Perl makes the fun jobs fun
      and the boring jobs bearable" - me

      I have to disagree with the implication that if an email address isn't mauled that it is ethical to grab it from a mailto. In practice, giving your email address may be akin begging for junk mail, but in theory (and ethics), I think that's different from requesting junk email.

      Update: strredwolf, exactly what I meant. I have my address on my site because I want people to be able to use it, not because I want junk mail.

      Update ichimunki, I think we do agree. I was writing my comment at the same time as Dave wrote his, and his sums up my point well enough.

        Sociting comments is one thing. Getting junk mail which burns up time , money, and bandwidth is another. I get too many junk mails in relation to the "Hey! Good artwork!" or "Have you tried this technique?" or "I want to commission you!" e-mails. Are they being drowned out?

        --
        $Stalag99{"URL"}="http://stalag99.keenspace.com";

        I don't really want to get into an ethics debate, and I thought I was pretty clear about what I thought of using harvested emails for the purpose of spam. Soliciting is soliciting, whether electronic, by phone, snail mail, or door-to-door. I am not interested in using the "ethics" club to bludgeon free speech, whether it's opinions I don't like or offers to buy more crap. I should be able to request in any medium that I not be contacted again, once that initial solicitation has been made.

        I also think sending anonymous spam should be a felony-- I put it on the same level as cracking passwords without permission-- attempts to subvert systems for unauthorized use. Other than mauling my email address to inhibit simple (or even the new-and-improved) harvesting, I cannot think of a single way to post information in public, and not expect the public to use that information if they want. Does robots.txt have an email solicitation "opt-in" flag?
Re: Perl Programs that can retrieve email addresses from web pages
by clemburg (Curate) on Jan 03, 2001 at 20:51 UTC

    You might want to take a look at Mastering Regular Expressions, page 316, for a regex that matches a valid email address (grin).

    Christian Lemburg
    Brainbench MVP for Perl
    http://www.brainbench.com

      Thanks to Abigail and Damian we actually do have a way of validating an email address these days: RFC::RFC822::Address...

      Tony

Re: Perl Programs that can retrieve email addresses from web pages
by boo_radley (Parson) on Jan 03, 2001 at 20:21 UTC
    Update *deletia* yeah, fine, so it's probably for spam... sue me for being interested in a problem.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://49528]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (4)
As of 2024-04-25 06:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found