Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Re^2: fetching and storing data from web.

by nicolethomson (Initiate)
on Jan 27, 2012 at 07:50 UTC ( #950274=note: print w/replies, xml ) Need Help??

in reply to Re: fetching and storing data from web.
in thread fetching and storing data from web.

Thanks Dear GrandFather

Yes I am going through the tutorials

I am going through HTML::TableExtract, as well

Initially using bash script "wget", i am downloading those .htm files and storing in local folder

but howto feed it to mysql or other sql


  • Comment on Re^2: fetching and storing data from web.

Replies are listed 'Best First'.
Re^3: fetching and storing data from web.
by GrandFather (Sage) on Jan 27, 2012 at 08:15 UTC

    LWP::Simple does essentially the same job as wget but gets the data straight into Perl.

    DBI does database stuff, but needs a driver to work with so you need to pick the DBD module to match the database you are using. Use DBD::mysql for mysql, but if you have a choice I'd start with SQLite (DBD::SQLite) because it is stand alone and requires no set up. MySQL can be hard to get going on some systems.

    True laziness is hard work
Re^3: fetching and storing data from web.
by chessgui (Scribe) on Jan 27, 2012 at 08:18 UTC
    Wget is also available for Win32 therefore I have experience with it. I have to tell you that when the fetch is complicated ( https + authentication + cookies + form posting + many redirections ) it seems more stable and business like than the LWP module. For me it was much more difficult to get it right with the LWP module than with wget. In one particular case I could not get it right ( the login was a success - I know because the cookies generated were correct - but somewhere in the chain the redirection was lost and I failed to receive the welcome login page - the same worked with wget ).

    If it comes to a simple 'GET' the LWP module proved to be perfect for me (it almost never freezes, the timeout works all right).
      In cases like this where you essentially need to simulate a browser visiting the website, you can also use WWW::Mechanize which does exactly that. You can even control a real web browser through Perl modules to get exactly the interaction you would have with a website when you use it through your browser (WWW::Mechanize::Firefox).
        Finally using manual methods suggested I was able to build WWW::Mechanize::Firefox on Win32 to the extent that 'use WWW::Mechanize::Firefox;' in and out of itself does not cause an error.

        However the object itself can not be created:

        use WWW::Mechanize::Firefox; open STDERR,'>>out.txt'; my $mech = WWW::Mechanize::Firefox->new(); ########################## output: Failed to connect to , problem connecting to "localhost", port 4242: N +em hozható létre kapcsolat, mert a célszámítógép már visszautasította + a kapcsolatot. at C:/strawberry/perl/site/lib/MozRepl/ line + 144
        The languge of my op. system is not english so the error message rougly means: 'connection failed because the destination computer refused to establish connection'. This message popped up during the build many times by the way. Note I've installed the necessary plugin for Firefox (mozrepl) and Firefox was running when I've received this message (I switched off the firewall but to no avail).

        Any thoughts on that?
        This looks very interesting. I would be happy if I could use such high level modules for web browsing.

        However on my Win32 this is the said result of the build:
        CPAN: CPAN::SQLite loaded ok (v0.199) Running install for module 'WWW::Mechanize' Running make for J/JE/JESSE/WWW-Mechanize-1.71.tar.gz CPAN: Digest::SHA loaded ok (v5.61) CPAN: Compress::Zlib loaded ok (v2.034) Checksum for C:\strawberry\cpan\sources\authors\id\J\JE\JESSE\WWW-Mech +anize-1.71.tar.gz ok CPAN: Archive::Tar loaded ok (v1.76) CPAN: File::Temp loaded ok (v0.22) CPAN: Parse::CPAN::Meta loaded ok (v1.4401) CPAN: CPAN::Meta loaded ok (v2.110930) CPAN: YAML loaded ok (v0.73) Going to build J/JE/JESSE/WWW-Mechanize-1.71.tar.gz WWW::Mechanize likes to have a lot of test modules for some of its tes +ts. The following are modules that would be nice to have, but not required +. Test::Memory::Cycle Test::Taint Checking if your kit is complete... Looks good Writing Makefile for WWW::Mechanize Could not read metadata file. Falling back to other methods to determi +ne prerequisites CPAN: Module::CoreList loaded ok (v2.46) cp lib/WWW/Mechanize/Examples.pod blib\lib\WWW\Mechanize\Examples.pod cp lib/WWW/Mechanize/ blib\lib\WWW\Mechanize\ cp lib/WWW/Mechanize/ blib\lib\WWW\Mechanize\ cp lib/WWW/Mechanize/Cookbook.pod blib\lib\WWW\Mechanize\Cookbook.pod cp lib/WWW/Mechanize/FAQ.pod blib\lib\WWW\Mechanize\FAQ.pod cp lib/WWW/ blib\lib\WWW\ C:\strawberry\perl\bin\perl.exe -MExtUtils::Command -e "cp" -- bin/mec +h-dump blib\script\mech-dump pl2bat.bat blib\script\mech-dump JESSE/WWW-Mechanize-1.71.tar.gz C:\strawberry\c\bin\dmake.EXE -- OK Running make test C:\strawberry\perl\bin\perl.exe "-MExtUtils::Command::MM" "-e" "test_h +arness(0, 'blib\lib', 'blib\arch')" t\00-load.t t\add_header.t t\alia +ses.t t\area_link.t t\autocheck.t t\clone.t t\content.t t\cookies.t t +\credentials-api.t t\credentials.t t\die.t t\field.t t\find_frame.t t +\find_image.t t\find_inputs.t t\find_link-warnings.t t\find_link.t t\ +find_link_id.t t\form-parsing.t t\form_with_fields.t t\frames.t t\ima +ge-new.t t\image-parse.t t\link-base.t t\link-relative.t t\link.t t\n +ew.t t\pod-coverage.t t\pod.t t\regex-error.t t\save_content.t t\sele +ct.t t\taint.t t\tick.t t\untaint.t t\upload.t t\warn.t t\warnings.t +t\local\back.t t\local\click.t t\local\click_button.t t\local\content +.t t\local\encoding.t t\local\failure.t t\local\follow.t t\local\form +.t t\local\get.t t\local\nonascii.t t\local\overload.t t\local\page_s +tack.t t\local\referer.t t\local\reload.t t\local\submit.t t\mech-dum +p\mech-dump.t t\00-load.t .............. ok t\add_header.t ........... ok t\aliases.t .............. ok t\area_link.t ............ ok t\autocheck.t ............ ok t\clone.t ................ ok t\content.t .............. ok t\cookies.t .............. skipped: HTTP::Server::Simple does not supp +ort Windows yet. t\credentials-api.t ...... ok t\credentials.t .......... ok t\die.t .................. ok t\field.t ................ ok t\find_frame.t ........... ok t\find_image.t ........... ok t\find_inputs.t .......... ok t\find_link-warnings.t ... ok t\find_link.t ............ ok t\find_link_id.t ......... ok t\form-parsing.t ......... ok t\form_with_fields.t ..... Dubious, test returned 1 (wstat 256, 0x100) All 8 subtests passed JESSE/WWW-Mechanize-1.71.tar.gz C:\strawberry\c\bin\dmake.EXE test -- NOT OK //hint// to see the cpan-testers results for installing this module, t +ry: reports JESSE/WWW-Mechanize-1.71.tar.gz

        This is why I want to be able to achieve my goals with the simplest possible means without relying on high level modules.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://950274]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2018-06-24 01:31 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (126 votes). Check out past polls.