Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

proof of concept how to run this code

by perl_lover_girl
on Aug 24, 2006 at 19:39 UTC ( #569435=perlquestion: print w/ replies, xml ) Need Help??
perl_lover_girl has asked for the wisdom of the Perl Monks concerning the following question:

g day dtoronto, hello cowboy - Hello to my dear perl-folks

many many thanks for the reply.

i fixed the code and made up my mind. HERE is what i try to accomplish.

first of all: i am happy to hear from you!! this is probably one of the best places to ask such questions. so i do it now.

first of - i have to explain something; I have to grab some data out of a phpBB in order to do some field reseach. I need the data out of a forum that is runned by a user community. I need the data to analyze the discussions.

to give an example - let us take this forum here. How can i grab all the data out of this forum - and get it local and then after wards put it in a local database - of a phpBB-forum - is this possible"?!"?

http://www.nukeforums.com/forums/viewforum.php?f=17

Nothing harmeful - nothing bad - nothing serious and angerous. But the issue is. i have to get the data - so what?

I need the data in a allmost full and complete formate. So i need all the data like
username .-
forum
thread
topic
text of the posting and so on and so on.

how to do that?

i need some kind of a grabbing tool - can i do it with that kind of tool. How do i sove the storing-issue into the local mysql-database.

Well you see that is a tricky work - and i am pretty sure taht i am getting help here. So for any and all help i am very very thankful

#many many thanks in advance

i am testing a code .- this is a proof of concept. Please do not bear with me as this is a perl-snippet. Can u help me. The question is; if i apply this to another forum - can i get any detailed results. thanks for any answer - thanks for any and all help

cheers

And here a codeexample that is runned against the forum.

#!/usr/bin/perl use strict; use warnings; use LWP::RobotUA; use HTML::LinkExtor; use HTML::TokeParser; use URI::URL; use Data::Dumper; # for show and troubleshooting my $url = "http://www.phpBBhacks.com/forums/viewforum.php?f=17"; my $ua = LWP::RobotUA->new; my $lp = HTML::LinkExtor->new(\&wanted_links); my @links; get_threads($url); foreach my $page (@links) { # this loops over each link collected from + the index my $r = $ua->get($page); if ($r->is_success) { my $stream = HTML::TokeParser->new(\$r->content) or die "Parse + error in $page: $!"; # just printing what was collected print Dumper get_thread($stream); # would instead have database insert statement at this point } else { warn $r->status_line; } } sub get_thread { my $p = shift; my ($title, $name, @thread); while (my $tag = $p->get_tag('a','span')) { if (exists $tag->[1]{'class'}) { if ($tag->[0] eq 'span') { if ($tag->[1]{'class'} eq 'name') { $name = $p->get_trimmed_text('/span'); } elsif ($tag->[1]{'class'} eq 'postbody') { my $post = $p->get_trimmed_text('/span'); push @thread, {'name'=>$name, 'post'=>$post}; } } else { if ($tag->[1]{'class'} eq 'maintitle') { $title = $p->get_trimmed_text('/a'); } } } } return {'title'=>$title, 'thread'=>\@thread}; } sub get_threads { my $page = shift; my $r = $ua->request(HTTP::Request->new(GET => $url), sub {$lp->pa +rse($_[0])}); # Expand URLs to absolute ones my $base = $r->base; return [map { $_ = url($_, $base)->abs; } @links]; } sub wanted_links { my($tag, %attr) = @_; return unless exists $attr{'href'}; return if $attr{'href'} !~ /^viewtopic\.php\?t=/; push @links, values %attr; }
$VAR1 = { 'thread' => [ { 'post' => 'Hello, I\'m pretty new to PHPNuke +. I\'ve got my site up and running great! I\'m now starting to make m +odifications, add modules etc. I\'m using the most recent RavenPHP76. + I want to display the 5 most recent forum posts at the top of the fo +rum page. I\'m not sure if this functionality is built in, if so, how + to activate. Or if there is a module or block made to do this. I loo +ked at Raven\'s Collapsing Forum block but wasn\'t crazy about the fo +rmat, and I don\'t want it to be collapsable. Thanks! mopho', 'name' => 'mopho' }, { 'post' => 'hi there', 'name' => 'sail' }, { 'post' => 'thanks for asking this; :not very + sure if i got you right; Do you want to have a feed of the last foru +mthreads? guess the easiest way is to go to raven and ask how he did +it. hth sail.', 'name' => 'sail' }, { 'post' => 'Thanks. i found what I was lookin +g for. It wasn\'t so easy to find! It\'s called glance_mod. mopho', 'name' => 'mopho' }, { 'post' => 'hi there thx', 'name' => 'sail' }, { 'post' => 'it sound interesting - i will hav +e also a look i google after it - and try to find out more regards sa +ilor', 'name' => 'sail' } ], 'title' => 'Recent Forum Posts Module' };
Hmm i want to grab data out of forum - (for my studies]

http://www.nukeforums.com/forums/viewforum.php?f=17

This is really preliminary. It just grabs the basic text from the threads and doesn't handle the quoted text right yet. hmmm would this be hard to fix. There are many parsing approaches that can be taken in perl,

we obviously also have to set up a database to capture information you want to store.

Additionally, this script just looped over the first index page, It didn't run over more than the first page it is set up a loop to grab each of the index pages

Well, dtoronto and cowboy i am a true PERL NEWBIE - and i need your help. what about the complete parsing (and harvesting of this both forum here http://www.nukeforums.com/forums/viewforum.php?f=17
http://www.nukeforums.com/forums/viewforum.php?f=3


i look forward to hear form you both dtoronto and cowboy

Comment on proof of concept how to run this code
Select or Download Code
Re: proof of concept how to run this code
by jdtoronto (Prior) on Aug 24, 2006 at 20:06 UTC
    Couple of problems:

    • You have repeated the code twice in your post, can you clean that up please.
    • What are you trying to do? This looks like some sort of forum ripping robot?
    • As submitted the initialisation of the LWP::RobotUA with the  new method seems to go into lala land and just get lost, but then the invocation of the method does not appear to be in accordance with the documentation.
    What testing of this have you done so far? Where were your problems?

    jdtoronto

Re: proof of concept how to run this code
by cowboy (Friar) on Aug 24, 2006 at 20:48 UTC

    I'd suggest reading How (Not) To Ask A Question. There isn't much we can do to help unless you explain what it is you are trying to accomplish rather than just throwing a pile of code at us

      hi Cowboy - many many thanks for your input.

      i appreciate your idea and any and all help.

      i have fixed my message and corrected all the mess.

      now i look forward to your help.

      my quesiton is; Can i do the grabbing and harvesting of the whole forum here



      http://www.nukeforums.com/forums/viewforum.php?f=17
      http://www.nukeforums.com/forums/viewforum.php?f=3


      with the script that is shown above. is this doable

      many many thanks fro all your help

      thanks

      perl_lover girl
Re: proof of concept how to run this code
by starbolin (Hermit) on Aug 24, 2006 at 22:53 UTC

    Do you have permission from the forum owner to do what you are attempting to do? Most forum owners would not appreciate the kind of load your script would put on their system. Many forum owners have filters in place to prevent robots like these. You would most likely get your account revoked and a nasty curse placed on your children.

    If you can convince the forum owners of a legitimate need they can probably give you direct access to their database. This puts a lesser load on their servers and offers you more options for searching/retrieving.

    I am sorry to say that your code is a mess. I don't know where to start. I think that you are biteing off too much at once. I would start with some simpler tasks. Like, just grab one page and print it. You should also write separate little programs to test your subroutines before using them, so that you know if the parameters are passed correctly.

    The text you posted here contained double breaks which made the post display funny. Use paragraphs <p> instead.

    In spite of the rain I put on your parade we appreciate your participation in our site and hope you continue to visit.



    s//----->\t/;$~="JAPH";s//\r<$~~/;{s|~$~-|-~$~|||s |-$~~|$~~-|||s,<$~~,<~$~,,s,~$~>,$~~>,, $|=1,select$,,$,,$,,1e-1;print;redo}
      hello again starbolin, many many thanks for the reply - since i am new to 1. perl 2. this board - i will need some help & time to get involved with both. here the code - and my question - can i apply this code to the whole board. In order to get a "Copy" of the board with category 17 and category 3 .... http://www.nukeforums.com/forums/viewforum.php?f=17 http://www.nukeforums.com/forums/viewforum.php?f=3 thx for the help and the reply! regards perl lover girl (the girl that loves perl so much and so hard - it is soo sweeet to get into this ;-) you cannot imagine ]
      #!/usr/bin/perl use strict; use warnings; use LWP::RobotUA; use HTML::LinkExtor; use HTML::TokeParser; use URI::URL; use Data::Dumper; # for show and troubleshooting my $url = "http://www.nukeforums.com/forums/viewforum.php?f=17"; my $ua = LWP::RobotUA->new; my $lp = HTML::LinkExtor->new(\&wanted_links); my @links; get_threads($url); foreach my $page (@links) { # this loops over each link collected from + the index my $r = $ua->get($page); if ($r->is_success) { my $stream = HTML::TokeParser->new(\$r->content) or die "Parse + error in $page: $!"; # just printing what was collected print Dumper get_thread($stream); # would instead have database insert statement at this point } else { warn $r->status_line; } } sub get_thread { my $p = shift; my ($title, $name, @thread); while (my $tag = $p->get_tag('a','span')) { if (exists $tag->[1]{'class'}) { if ($tag->[0] eq 'span') { if ($tag->[1]{'class'} eq 'name') { $name = $p->get_trimmed_text('/span'); } elsif ($tag->[1]{'class'} eq 'postbody') { my $post = $p->get_trimmed_text('/span'); push @thread, {'name'=>$name, 'post'=>$post}; } } else { if ($tag->[1]{'class'} eq 'maintitle') { $title = $p->get_trimmed_text('/a'); } } } } return {'title'=>$title, 'thread'=>\@thread}; } sub get_threads { my $page = shift; my $r = $ua->request(HTTP::Request->new(GET => $url), sub {$lp->pa +rse($_[0])}); # Expand URLs to absolute ones my $base = $r->base; return [map { $_ = url($_, $base)->abs; } @links]; } sub wanted_links { my($tag, %attr) = @_; return unless exists $attr{'href'}; return if $attr{'href'} !~ /^viewtopic\.php\?t=/; push @links, values %attr; }
      again - my question is - can i apply the code on the part of the board. In order to get a "Copy" of the board with category 17 and category 3 .... http://www.nukeforums.com/forums/viewforum.php?f=17 http://www.nukeforums.com/forums/viewforum.php?f=3 look forward to hear from you regards
        The minimal change consists of changing
        my $url = "http://www.nukeforums.com/forums/viewforum.php?f=17"; my $ua = LWP::RobotUA->new; my $lp = HTML::LinkExtor->new(\&wanted_links); my @links; get_threads($url); foreach my $page (@links) { ... }
        to
        my $ua = LWP::RobotUA->new; my $lp = HTML::LinkExtor->new(\&wanted_links); my @links; foreach my $forum_id (17, 3) { my $url = "http://www.nukeforums.com/forums/viewforum.php?f=$forum +_id"; @links = (); # yuck! my $links = get_threads($url); foreach my $page (@$links) { ... } }

        As you can see, I don't like your use of the global variable @links. We're forced to provide and initialize a variable that should be local to get_threads. Here's the fix:

        #!/usr/bin/perl use strict; use warnings; use LWP::RobotUA; use HTML::LinkExtor; use HTML::TokeParser; use URI::URL; use Data::Dumper; # for show and troubleshooting my $ua = LWP::RobotUA->new(); foreach my $forum_id (17, 3) { my $url = "http://www.nukeforums.com/forums/viewforum.php?f=$forum +_id"; my $links = get_threads($url); foreach my $page (@$links) { ... } } sub get_thread { ... } sub get_threads { my $page = shift; my @links; my $lp = HTML::LinkExtor->new(sub { my($tag, %attr) = @_; return unless exists $attr{'href'}; return if $attr{'href'} !~ /^viewtopic\.php\?t=/; push @links, values %attr; }); my $request = HTTP::Request->new(GET => $url); my $response = $ua->request($request, sub {$lp->parse($_[0])}); # Expand URLs to absolute ones my $base = $response->base; return [ map { url($_, $base)->abs } @links ]; }

        Update: Added the minimal change.

        Edited by planetscape - Reparented from Reaped: LWP & HTML::LinkExtor running recursively against a bulletin board to Re^3: proof of concept how to run this code

Re: proof of concept how to run this code
by skx (Parson) on Aug 25, 2006 at 09:43 UTC

    You don't seem to have answered the questions asked by the other monks above, so I'm not sure if you're actually having permission to do this or not.

    If you have permission then it seems that your goal of inserting all the thread data into a database is pointless - it is already in a database on the server!

    So, assuming you have permission to copy the details, etc, then you should simply ask the host to export the database for you.

    If you don't have permission then spidering the site is probably your best option - if you do it slowly and carefully.

    However down that path lies madnes..

    Steve
    --
      i there, since i have no permission, i have to go the way down to Madness. - i count on you - Folks ineed your help. otherwise i really would get mad.. [would oyu do the job with regexpress or with the script above. ths for an answer. regard the perl lover girl love perl so hard
        BTW - plz take another forum -. this forum drives me mad - it does so - AND I have other things that make me mad also your s lovergirl
Re: proof of concept how to run this code
by mantra2006 (Hermit) on Aug 25, 2006 at 17:49 UTC
    Hello

    I have gone through your question and follow ups from you and looks to me you are expecting somebody to write a perl code for you

    programming doesn't work like that..this is where software development life cycle comes into picture

    you got to put down requirements and analyse those requirements and once requirements are confirmed then based on that you got to design the project. Then comes the coding part where one who is developing will decide to use perl or some other to develop then testing implementation and maintainence..

    And also please specify when you run this code where is the point you are getting errors and what errors

    its very hard to read the code and tell you will it work if I run type of questions


    Sridhar

    Edited by planetscape - removed unnecessary br tags and replaced with p tags

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://569435]
Approved by chargrill
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2014-07-13 23:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (252 votes), past polls