|We don't bite newbies here... much|
collect data from web pages and insert into mysqlby SteinerKD (Acolyte)
|on Jul 30, 2010 at 15:14 UTC||Need Help??|
SteinerKD has asked for the wisdom of the Perl Monks concerning the following question:
First, sorry for the long post and clueless nature.
I have set myself a task to create a script that can collect data from web pages and insert them into a MySQL database. I'm a complete noob at this though and not even sure what language I need (to learn), but think perl might be it. What I ask now is not for you to tell me whow to do it, only if it's feasible or if I'm barking up the wrong tree (pointers on where to find relevant information is wellcome though.
First step would be to export a list of pids to be processed, each paired with the last sid processed for the pid.
The script would read the list and set the first pid in list as current.
Next step would be for it to add current pid to a URL and load that page containing a list.
From this page a list of sids needs to be collected untill I hit the "last processed" one, these might be spread over several pages so it need to keep going either until it finds "last processed" or there's no further pages to load (a fail I guess).
Next is the new sid list created in the previous step, each one need to be processed and data collected some basic data is collected frrom each sid and then 2 possible (but not always excistant) lists.
The basic data collected for the sid cotains two values to be set as variables, these decides how many data blocks needs to be collected lower down on the page.
Go to first type block, collect the data I want and repeat as many times as variable says.
Go to second type block and repeat.
Store the data collected from previous in a textfile named after pid, it should contain 4 sections of data to be inserted into 4 databases.
First section update the pid with new last processed.
Second section add sids with info to DB.
Third section add the data from type 1 blocks on sid pages to DB.
Fourth section section add the data from type 2 blocks on sid pages to DB.
Close the file, load next pid from list and repeat the process until pid list is empty.
A guess a bonus at the end would be if it could also insert all the data collected into the db as well.
Is this something perl would be suitable for or is there a better choise?
My system is Win 7 64bit btw, running MySQL 5.1 and strawberry perl 5.12.