Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^5: getting LWP and HTML::TokeParser to run

by Marshall (Canon)
on Oct 10, 2010 at 18:51 UTC ( [id://864504]=note: print w/replies, xml ) Need Help??


in reply to Re^4: getting LWP and HTML::TokeParser to run
in thread getting started with LWP and HTML::TokeParser

Well as far as use policy goes, do check. When I run automated scripts, I do it at night during low load times. And I often put in a sleep() after some number of requests to slow things down.

One thing to investigate is whether or not this site provides the information that you need in an easier format than web pages? Many big sites do that. Some sites I use actually have a separate URL for automated requests and even provide tools to use their more efficient computer to computer methods.

On the other hand, this site has "bandwith to burn". I don't think that they will notice 5,000 pages. But do testing with a small set of pages.

Replies are listed 'Best First'.
Re^6: getting LWP and HTML::TokeParser to run
by Perlbeginner1 (Scribe) on Oct 10, 2010 at 19:33 UTC
    Hello Marto, hello Marshall, good evening!

    My plan is to visit each of these five thousand - i do not intend to hammer the server;-)

    I agree with Marshall: "On the other hand, this site has "bandwith to burn". I don't think that they will notice 5,000 pages. But do testing with a small set of pages."

    This governmental site has a very very big server!

    Well - if i get all the pages with "Mechanize" do i have to use HTML::TokeParser as well!? - for the parsing - in order to get the information of all single pages? i have read on the CPAN-Site for Mechanize:

    $mech->find_all_inputs( ... criteria ... )

    find_all_inputs() returns an array of all the input controls in the current form whose properties match all of the regexes passed in. The controls returned are all descended from HTML::Form::Input.

    If no criteria are passed, all inputs will be returned.
    If there is no current page, there is no form on the current page, or there are no submit controls in the current form then the return will be an empty array.

    You may use a regex or a literal string:
    # get all textarea controls whose names begin with "customer" my @customer_text_inputs = $mech->find_all_inputs( type => 'textarea', name_regex => qr/^customer/, ); # get all text or textarea controls called "customer" my @customer_text_inputs = $mech->find_all_inputs( type_regex => qr/^(text|textarea)$/, name => 'customer', );
    Well that would be great if i can run Mechanize with some additional jobs for the parsing-part! If this is possibie it would be great!

    @Marshall: i can have a look if they provide the information i need in an easier format than web pages? But i guess that i have to go the way to fetch page by page--- guess that it is the best way to do this in a nightly job!

    i come back and report all findings! Untill soon!

    regards Perlbeginner1


    what is aimed: 17 lines of text. This information-set is wanted - 5081 Times:


    see an example here: Allgemeine Daten der Schule / Behörde:

    Schul-/Behördenname: Herzog-Philipp-Verbandsschule Grund- u. Werkrealschule
    Schulart: Öffentliche Schule (04139579)
    Hausadressse: Ebersbacher Str. 20, 88361 Altshausen
    Postfachadresse: Keine Angabe
    Telefon: 07584/92270
    Fax: 07584/922729
    E-Mail: poststelle@04139579.schule.bwl.de
    Internet: www.hpv-altshausen.de
    Übergeordnete Dienststelle: Staatliches Schulamt Markdorf
    Schulleitung: Mößle, Georg
    Stellv. Schulleitung: Schneider, Cornelia
    Anzahl Schüler: 456
    Anzahl Klassen: 19
    Anzahl Lehrer: 39
    Kreis: Ravensburg
    Schulträger: <kein Eintrag> (Ohne Zuordnung)



    this is a true PERL-Job. I think that PERL can do this kind of job with ease! All those 5081 pages are human readable - but if i try to click page by page and read all data - it would take me more than a month -

    If i can do it with PERL then i only need to have the code for Parsing it once!!!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://864504]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (3)
As of 2024-03-19 08:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found