Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Scraping PerlMonks

by Juerd (Abbot)
on Apr 28, 2003 at 20:45 UTC ( #253834=CUFP: print w/ replies, xml ) Need Help??

My signature had a line in <pre> that was very long and messing up the PM layout for some. Good excuse to use WWW::Mechanize again :)

#!/usr/bin/perl -wl use strict; use Carp; use WWW::Mechanize; use HTML::TreeBuilder; { my $i; sub x { shift() ? print ++$i : croak } } my $sig = q[ Juerd # { site => '<a href="http://juerd.nl/" target="_blank"><font color="# +800000">juerd.nl</font></a>', plp_site => '<a href="http://plp.juerd. +nl/" target="_blank"><font color="#800000">plp.juerd.nl</font></a>', +do_not_use => '<a href="mailto:spamcollector_perlmonks@juerd.nl" targ +et="_blank"><font color="#800000">spamtrap</font></a>' } ]; x my $agent = WWW::Mechanize->new; x $agent->get('http://perlmonks.org/index.pl?node=login'); x $agent->submit_form( form_number => 2, fields => { user => 'Juerd', passwd => '', # Guess =) }, button => 'sexisgood', ); x $agent->get('http://perlmonks.org/index.pl?node_id=6364&user=Juerd') +; my $tree = HTML::TreeBuilder->new_from_content($agent->content); for ($tree->look_down(id => 'writeups')->look_down(_tag => 'a')) { x my $href = $_->attr('href'); x my ($node_id) = $href =~ /node_id=(\d+)/; next if $node_id < 253475; # 253475 was the first node with the n +ew sig x $agent->get($href); x $agent->form_number(2); my $post = eval { $agent->current_form->value('note_doctext'); } or next; $post =~ s[<pre>(.*?)</pre>][$sig]s; x $agent->submit_form( form_number => 2, fields => { note_doctext => $post }, button => 'sexisgood', ); }

There were less than 50 nodes to be fixed, so I didn't have to fetch other pages. And this script only works for notes :)

Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

Comment on Scraping PerlMonks
Download Code
Re: Scraping PerlMonks
by AssFace (Pilgrim) on Apr 29, 2003 at 13:14 UTC
    Heh heh - I never bothered to look that closely at the source. "sexisgood" for the comment submit button and "sexisgreat" for the vote button.
    I was kind of expecting to see something else that then said "butheroinisbetter" or s/heroin/perl/

    Now I'm curious if there are other spots of humor in the source here... or just swearing, which depending on one's mood is arguably funny as well.

    Oh - and nice job on the sig update script as well :)


    -------------------------------------------------------------------
    There are some odd things afoot now, in the Villa Straylight.
Re: Scraping PerlMonks
by Mr_Person (Hermit) on Apr 30, 2003 at 17:25 UTC
    I don't think I understand the reason for x sub. Is it just to keep track of how many operations the script does? Also, do you get many emails to your spamtrap? I've always wondered how effective those were.

      I don't think I understand the reason for x sub. Is it just to keep track of how many operations the script does?

      The "x" sub has two purposes:

      1. I don't want to use "or die" with every command, so I want an easy-to-type command to do these assertions for me (have a look at Carp::Assert - the difference here is that my "x" calls shouldn't be left out).
      2. I want some output to see if the script does anything and I don't want a lot of print commands. The counter has no real purpose.
      It's just laziness :)

      Also, do you get many emails to your spamtrap? I've always wondered how effective those were.

      On spamcollector_perlmonks@juerd.nl, I get 5 or 6 messages per day. For some reason, PerlMonks spammers are smarten than others, because over 15% of all messages to spamcollector_perlmonks@juerd.nl was not flagged by SpamAssassin already, while other than that I hardly ever have false negatives!

      Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://253834]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (9)
As of 2014-07-29 06:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (211 votes), past polls