Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: collect data from web pages and insert into mysql

by Your Mother (Archbishop)
on Jul 30, 2010 at 16:21 UTC ( [id://852109]=note: print w/replies, xml ) Need Help??


in reply to collect data from web pages and insert into mysql

This can be done by a noob (if the site doesn't use JavaScript to load/present data). I'd recommend WWW::Mechanize over LWP(::UserAgent) because it is LWP::UserAgent under the hood and has much better and more browser like controls.

Hurdles include: DB design which is easy to do badly if you've never done it and that will make everything much harder. ETL is pretty easy but only after you've done it several times. For a noob, even a technically gifted one, this is a project that would fill up a couple full time weeks at least.

You'll probably get good advice here if you ask at each stage after you've tried to work something out for yourself. E.g.: I wrote perl-xyz to do stage 1 of project; is this a good way to do it?

Replies are listed 'Best First'.
Re^2: collect data from web pages and insert into mysql
by SteinerKD (Acolyte) on Jul 30, 2010 at 16:58 UTC

    Thanks!

    It's nice and refreshing to be greeted in such a friendly and helpful way. I think I have Mechanize now (did the "cpan WWW::Mechanize" from prompt and lots of stuff happened ;) ).

    I guess the first part as training will be reading the pid list from a file, setting it as variable and then create a file with pid as name and the repeat untill the list is done.
    (more or less the start and end of the entire project).

      Hmm, actually managed (with the help from AWP) to create a valid URL (inserting pid and page number) for a sortie list page and have the source printed to screen.

      Looking at the resulting code it should be simple (I think) to create a list of sids (sortie pages) to process as they are all listed in the source as sid=XXXXXX (inside a string).

      Must say my head is spinning a bit though, this is a lot to take in (I mainly copied something I found and adapted it, not like I could write it from scratch myself).

      I guess next step would be to store the page as a temp file and figure out how to grab and save those sids.

      Thanks for encouragement and help!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://852109]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2024-04-26 00:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found