Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

collect data from web pages and insert into mysql

by SteinerKD (Acolyte)
on Jul 30, 2010 at 15:14 UTC ( #852103=perlquestion: print w/replies, xml ) Need Help??

SteinerKD has asked for the wisdom of the Perl Monks concerning the following question:

First, sorry for the long post and clueless nature.

I have set myself a task to create a script that can collect data from web pages and insert them into a MySQL database. I'm a complete noob at this though and not even sure what language I need (to learn), but think perl might be it. What I ask now is not for you to tell me whow to do it, only if it's feasible or if I'm barking up the wrong tree (pointers on where to find relevant information is wellcome though.

First step would be to export a list of pids to be processed, each paired with the last sid processed for the pid.
The script would read the list and set the first pid in list as current.
Next step would be for it to add current pid to a URL and load that page containing a list.
From this page a list of sids needs to be collected untill I hit the "last processed" one, these might be spread over several pages so it need to keep going either until it finds "last processed" or there's no further pages to load (a fail I guess).

Next is the new sid list created in the previous step, each one need to be processed and data collected some basic data is collected frrom each sid and then 2 possible (but not always excistant) lists.
The basic data collected for the sid cotains two values to be set as variables, these decides how many data blocks needs to be collected lower down on the page.
Go to first type block, collect the data I want and repeat as many times as variable says.
Go to second type block and repeat.

Store the data collected from previous in a textfile named after pid, it should contain 4 sections of data to be inserted into 4 databases.

First section update the pid with new last processed.
Second section add sids with info to DB.
Third section add the data from type 1 blocks on sid pages to DB.
Fourth section section add the data from type 2 blocks on sid pages to DB.

Close the file, load next pid from list and repeat the process until pid list is empty.

A guess a bonus at the end would be if it could also insert all the data collected into the db as well.

Is this something perl would be suitable for or is there a better choise?
My system is Win 7 64bit btw, running MySQL 5.1 and strawberry perl 5.12.

TIA
Thomas
  • Comment on collect data from web pages and insert into mysql

Replies are listed 'Best First'.
Re: collect data from web pages and insert into mysql
by Your Mother (Archbishop) on Jul 30, 2010 at 16:21 UTC

    This can be done by a noob (if the site doesn't use JavaScript to load/present data). I'd recommend WWW::Mechanize over LWP(::UserAgent) because it is LWP::UserAgent under the hood and has much better and more browser like controls.

    Hurdles include: DB design which is easy to do badly if you've never done it and that will make everything much harder. ETL is pretty easy but only after you've done it several times. For a noob, even a technically gifted one, this is a project that would fill up a couple full time weeks at least.

    You'll probably get good advice here if you ask at each stage after you've tried to work something out for yourself. E.g.: I wrote perl-xyz to do stage 1 of project; is this a good way to do it?

      Thanks!

      It's nice and refreshing to be greeted in such a friendly and helpful way. I think I have Mechanize now (did the "cpan WWW::Mechanize" from prompt and lots of stuff happened ;) ).

      I guess the first part as training will be reading the pid list from a file, setting it as variable and then create a file with pid as name and the repeat untill the list is done.
      (more or less the start and end of the entire project).

        Hmm, actually managed (with the help from AWP) to create a valid URL (inserting pid and page number) for a sortie list page and have the source printed to screen.

        Looking at the resulting code it should be simple (I think) to create a list of sids (sortie pages) to process as they are all listed in the source as sid=XXXXXX (inside a string).

        Must say my head is spinning a bit though, this is a lot to take in (I mainly copied something I found and adapted it, not like I could write it from scratch myself).

        I guess next step would be to store the page as a temp file and figure out how to grab and save those sids.

        Thanks for encouragement and help!

Re: collect data from web pages and insert into mysql
by dasgar (Priest) on Jul 30, 2010 at 15:41 UTC

    I'm not sure that I understand the "pid" and "sid" stuff, but the answer is yes, this kind of task (retrieve a web page, parse for data, store the retrieved data into a DB) can be done with Perl. For the web pages, you can use a module like LWP to grab the contents. From there, you can either use regular expressions to manually parse the data or you might find an HTML parser module that may help out. As for the DB connection, you'll probably want to use DBD::mysql, which will allow you to query your MySQL DB as well as do inserts and updates.

    Since you're new to Perl, you'll definitely want to keep http://search.cpan.org handy. That's where you'll be able to search for other Perl modules.

    Welcome to the wonderful world of Perl!

      The pid and sid I mention as just 6 digit numbers identifying personas and sorties. What I'm going to do is collect stats for a squad I'm part of for a online game for starts and reward purposes.

      Each member have 3 different personas (pids) and those are inserted into URLs to get a page with their lists of sorties. Each line contain a link to a sortie page (url+sid, again sig digit number). Those pages contains the main data, some basic data and two lists with either kills or captures (stored in table row blocks)

      Does that makes things clearer?
      Thanks for the suggestions, I'll look those up, but will probably be quite a while before I get even close to anything useful.

        Now that you've mentioned going to one web page and following a link to another (I must have missed that point when I first read your post), I'd agree with "Your Mother"'s suggestion of using WWW::Mechanize.

        Also, since you mentioned that this is to gather player stats from an online game, you might want to search for that game's name at CPAN. There's a chance that someone may have written a module to do some/all of the work you're wanting to do. My philosophy is: why reinvent the wheel if you don't have to?

Re: collect data from web pages and insert into mysql
by wfsp (Abbot) on Jul 31, 2010 at 07:10 UTC
    Sounds like an interesting project and, from what you've told us, would be entirely suitable for Perl. The other monks have given you some good pointers.

    If you could show us some example input (short but typical) and what the resulting url should look like and the code you have so far for doing that we may be able to give you some pointers. If you arrange your example code so that monks can download it and run it you improve your chances of getting help.

    The usual health warnings apply: use strict and warnings, make sure it compiles and try to use short but meaningful variable names.

    Good luck!

      This is what I have so far, it kinda works but lack finesses and is pretty seriously flawed, but hey, it's only one day into my Perl adventure yet so I think I've done OK so far.

      use strict; use warnings; use LWP::Simple qw(get); use File::Slurp ; # pid = Persona ID, one of a players 3 identities. # sid = Sortie ID, identifier for a mission taken by the persona. # We want to crawl all sortie list pages and collect new sid's we # have't seen before and then move on to next persona. my $pbase = 'http://csr.wwiionline.com/scripts/services/persona/sortie +s.jsp'; my $pcnt = 1; my $pidfile = 'c:/scr/pidlist.txt'; # Open list of pid's and set first one as current pid. open PIDLIST, "<", $pidfile or die "Could not open $pidfile: $!"; my $pid = <PIDLIST>; chomp $pid; print $pid; # Grab and store sortie list pages for persona. while (1) { my $page = get "$pbase?page=$pcnt&pid=$pid"; # Store grabbed webpage into the file append_file( "c:/scr/$pid.txt", $page ) ; # Update page number and grab next. $pcnt += 1; }; # Close files close PIDLIST or die $!; print '\nDone!\n';

      Flaws in this is that the server will quite happily keep giving you empty sortie list pages so just updating the page count and hoping for a fail to exit doesn't work (resulting in a huge file).
      I want the loop to exit under either of two conditions, either the string "No more sorties" are found on the page (end of list) OR a sid string equal to the stored variable for the last one processed is reached. (sids are six digit strings that I need to collect from the collected pages).

      This code is using LWP, but suggestions was for Mechanize so I need to rewrite to use that instead.
      Also need to redo the load pid bit so it actually works it's way through the list of pids, it will also have to fetch two variables in pairs eventually (in addition to the pid the last processed sid).
      Tried using Slurp to open and read the pidlist file, but that didn't work out as planned.
      For some reason $pid isn't printed out as supposed any more.

      When that's achieved comes the tricky part of collecting the actual sortie pages and extracting the data I need from them.

      Any suggestions on good coding practices and habits to pick up s appreciated, might as well learn to do it right from start.

        That looks as though you are off to a good start.

        I have a couple of observations and some more questions.

        my $pid = <PIDLIST>;
        will read the first line from the file. To look at each line in turn loop over the file handle with a while loop.

        The docs for LWP::Simple show how to check if the get is successful. Always best to do that.

        First question, what does a line in the pidlist.txt file actually look like? Is there any processing to be done to extract any data you need?

        Secondly, could you show a (cut down) sample of what the downloaded content looks like. It may be that you can extract the data you need on the fly and write the whole lot out in one go rather than save each page to disk and parse it later.

        Lastly, it looks as though the site needs a login and uses cookies. I'm curious to know how you managed to download any pages.

        Looking good though!

        This is some code illustrating the points above.

        #! /usr/bin/perl use strict; use warnings; use LWP::Simple qw(get); use File::Slurp ; # pid = Persona ID, one of a players 3 identities. # sid = Sortie ID, identifier for a mission taken by the persona. # We want to crawl all sortie list pages and collect new sid's we # have't seen before and then move on to next persona. my $pbase = q{http://csr.wwiionline.com/scripts/services/persona/sorti +es.jsp}; my $pcnt = 1; my $pidfile = q{c:/scr/pidlist.txt}; # Open list of pid's open my $pidlist, q{<}, $pidfile or die qq{Could not open $pidfile: $! +}; # loop over list of pids one at a time while (my $pid = <$pidlist>){ print $pid; chomp $pid; # what does a line from the pidlist look like? # do need to do any work on it? while (1) { my $url = qq{$pbase?page=$pcnt&pid=$pid}; my $content = get($url); die qq{get failed for $url: $!} unless $content; # parse $content # extract data we need # test if there is more work to do # Update page number and grab next. $pcnt += 1; }; } # Close files close $pidlist or die $!; print qq{\nDone!\n};

        Wohoo, think I sorted the loop!

        while (1) { my $page = get "$pbase?page=$pcnt&pid=$pid"; last if $page =~/No sorties/; # Store grabbed webpage into the file append_file( "c:/scr/$pid.txt", $page ) ; last if $page =~/"sid=$lproc"/; # Update page number and grab next. $pcnt++; };

        I'm sure the whole thing can be made prettier and more efficient, but now it seems to work as intended, grabbing the first set of data I need.

        Now I need to sort the bit where it does the same for all pids in the pidlist, preferably at the same time adding so it extracts both the pid and lproc data from the settings file.

      Wfsp asked for more info so I grabbed some screenies to better explain what I need to do

      First I need to export from a db a list of persona IDs (pid) to process, each paired with the the id of the last processed sortie (sid).

      Script will now read first pid and sid and load the first sortie list page for the persona and store the content and keep doing so untill it finds and empty page or one containing the last processed sortie.
      Image of sortie page.

      Now I need to process the stored code and extract all sids found that are new and make a list of that for processing.

      Next step, grab all the data I need from each sortie detail page in the list (URL constructed from the grabbed sids).
      Image of sortie details page with the data I need. (Link fixed)
      Store the data in a way that can be imported into a DB. When last sid is done we're also done with the persona so load next pid/lproc and start over (save the highest numbered sid to update the lastproc stat).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://852103]
Approved by salva
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (2)
As of 2021-10-23 12:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My first memorable Perl project was:







    Results (88 votes). Check out past polls.

    Notices?