getting started with LWP and HTML::TokeParser

by Perlbeginner1 (Scribe)
on Oct 10, 2010 at 08:46 UTC

Dear Perl Monks, I have a problem with LWP and HTML::TokeParser

i want to access an URL and this URL just has got many very very simmilar pages whith content of interest. To do this job - getting content from aparticular URL, the simplest way to do it is to use LWP::Simple's functions.

With Perl, we can call its get($url) function. It will try getting that URL's content. If it works, then it'll return the content; but if there's some error, it'll return undef.

so what is the problem: if you see this page here:
and press all - then you get a site with lines (links):

with the endings from 04126159 to somewhat 0490000 (many of them are empty - so we have to run from zero to 06000000 to get all! In other words: in order to get all the pages we have to count the URL from somewhat 041000000 to 04999999 or even better to 06000000
If i am able to get this - to count up to and LWP runs well then i need to Parse the content with
HTML::TokeParser HTML::Treebullder LibXML or somehwat like this... in order to get the content out of the pages

This content is wanted out of each pages....:

Allgemeine Daten der Schule / Behörde:

Schul-/Behördenname: Herzog-Philipp-Verbandsschule Grund- u. Werkrealschule
Schulart: Öffentliche Schule (04139579)
Hausadressse: Ebersbacher Str. 20, 88361 Altshausen
Postfachadresse: Keine Angabe
Telefon: 07584/92270
Fax: 07584/922729
Übergeordnete Dienststelle: Staatliches Schulamt Markdorf
Schulleitung: Mößle, Georg
Stellv. Schulleitung: Schneider, Cornelia
Anzahl Schüler: 456
Anzahl Klassen: 19
Anzahl Lehrer: 39
Kreis: Ravensburg
Schulträger: <kein Eintrag> (Ohne Zuordnung)

See a HTML-page - with the results:
04126159 +ETEHREF= <!-- WRAPPED CONTENT --> <table id="wrappedcontent"> <tr><td> <br/> <br> <p><a href="../../menu/1188427/index.html?COMPLETEHREF=h +ttp://">Schnellsuche</a> + | <a href="../../menu/1188427/index.html?COMPLETEHREF=http://www.kul">Erweiterte Suche</a> | <a href="../. +./menu/1188427/index.html?COMPLETEHREF= +frage/hilfe.php">Hilfe</a><script language="javascript"> document.write(' | <a href="javascript:history.back()">zur&uuml;ck zur + Trefferliste</a>'); </script> </p><h1>Allgemeine Daten der Schule / Beh&ouml;rde:</h1>&nbsp;<table + border="0" bgcolor="#EFEFEF" leftmargin="15" topmargin="5"><tr> <t +d><strong>Schul-/Behördenname:</strong>&nbsp;</td> <td width=500> + Herzog-Philipp-Verbandsschule Grund- u. Werkrealschule + </td></tr><tr> <td><strong>Schulart:</strong>&nbsp;</td> <td width +=500> Öffentliche Schule (04139579) </td></tr><tr><td +><strong>Hausadressse:</strong>&nbsp;</td><td>Ebersbacher Str. 20,&nb +sp;88361&nbsp;Altshausen</td></tr><tr> <td><strong>Postfachadresse:< +/strong>&nbsp;</td> <td> Keine Angabe </td></tr><tr> + <td><strong>Telefon:</strong>&nbsp;</td> <td> 07584/92270 + </td></tr><tr> <td><strong>Fax:</strong>&nbsp;</td> <td> + 07584/922729 </td></tr><tr> <td><strong>E-Mail:</stron +g>&nbsp;</td> <td> <a href="mailto:poststelle@04139579.schu" TARGET="_blank"></a> + </td></tr><tr> <td><strong>Internet:</strong>&nbsp;</td> +<td> <a href=" +" target="_blank"></a><br> </td +></tr><tr> <td><strong>&Uuml;bergeordnete Dienststelle:</strong> +&nbsp;</td> <td> <a href="http://www.s" target="_blank">Staatliches Schulamt Markdorf </ +a><br> </td></tr><tr> <td><strong>Schulleitung:</st +rong>&nbsp;</td> <td> M&ouml;&szlig;le, Georg </td>< +/tr><tr> <td><strong>Stellv. Schulleitung:</strong>&nbsp;</td> <td> + Schneider, Cornelia </td> </td></tr><tr> <td><stro +ng>Anzahl Sch&uuml;ler:</strong>&nbsp;</td> <td> 456 + </td></tr><tr> <td><strong>Anzahl Klassen:</strong>&nbsp;</td> <td +> 19 </td></tr><tr> <td><strong>Anzahl Lehrer:</stro +ng>&nbsp;</td> <td> 39 </td></tr><tr> <td><strong>K +reis:</strong>&nbsp;</td> <td> Ravensburg </td></tr> +<tr> <td><strong>Schulträger:</strong>&nbsp;</td> <td> &lt +;kein Eintrag&gt; (Ohne Zuordnung) + </td></tr></table><!--<table border="0"> <tr> <td><br><p>Die Adres +sdaten (Hausadresse, Postfachadresse, Telefon, Fax und Internet) werd +en vom Kultusministerium (Referat 15, Information und Kommunikation, +Iuk-Verfahren in Schulen und Schulverwaltung) zur Verfügung gestellt +- Änderungswünsche können Sie per E-Mail <a href=" service-bw-Schuladressdatenänderung">an das Serv +ice Center SVN</a> übermitteln. </p><p>Für die Änderung aller anderen + Angaben wenden Sie sich bitte an Ihre obere Schulaufsichtsbehörde. < +/p><p>Die Schüler-, Lehrer- und Klassenzahlen beruhen auf Daten der l +etzten amtlichen Schulstatistik (Ende Januar).</p>//--><!-- </td> < +/tr></table>//--> </td></tr> </table> <!-- WRAPPED CONTENT END -->

this is what i have allready:
#!/usr/bin/perl use strict; # use warnings; # use diagnostics; # use LWP::Simple; # use HTML::TokeParser; # my $url = ' '; # Just an example: the URL where we have to count up in order to g +et all the pages we have to count the URL from somewhat 041000000 to +04999999 or even better to 06000000 use LWP::Simple; my $content = get $url; die "Couldn't get $url" unless defined $content; # Then go do things with $content, like this: # start a new Parser-job with my $p = HTML::TokeParser->new($url) or die "Can't open $url: ($!)"; #find the tags 'xyz' while (my $tag = $p->get_tag('div', '/html')) # my output... !! my $out_file='./output.xml';

Dear Monks - can i go furhter with this approach!? any and all help is greatly appreciated! your perlbeginner1

Replies are listed 'Best First'.
Re: getting LWP and HTML::TokeParser to run
by Marshall (Canon) on Oct 10, 2010 at 11:09 UTC
    Go to:

    You will have to do a submit with a Suchbegriff: of "*"
    That will result in a page of 5081 results. To get the sub-pages pages you want, you will have to "click" via LWP or whatever to follow these links all 5081 of them.

    Start with trying to submit the search term of "*" on the main page and see if you can do that.

      Hello Marshall

      many thanks for the reply! i can do as you adviced. I can see the 5081 results.

      Now i have to get the sub-pages pages. I have to "click" via LWP to follow these links - all 5081 of them.
      And then i have to do the Job with HTML-TREEBuilder or use HTML::TokeParser!
      I for one prefer HTML::TokeParser since i know this a little bit.

      i have very very little experience with HTML::TokeParser (not too much - so i guess that the parser-part will be something over my skills)

      but as first things come first how should the LWP-Part look!?

      any and all help will be greatly appreciated!


        Here is an example which uses WWW::Mechanize to visit the page, populate the field and submit the form. Error checking is left as a exercise for you, this is a short example to get you started:

        #!/usr/bin/perl use strict; use warnings; use WWW::Mechanize; my $url = ' +dler_yno/index.html'; my $mech = WWW::Mechanize->new(); $mech->get( $url ); $mech->field('einfache_suche','*'); $mech->submit(); # $mech->content now contains the results page.

        I can't read German, so you'd better check that you're not breaking any site policy regarding automation.

        I would go with marto's advice about WWW::Mechanize. I haven't used it yet, but I hear that it is great. I suspect that you will find it easier to use than any advice I could give about decoding the raw HTML to get the next pages to "click" on. You are getting about 5K pages from a huge government website that performs very well. I wouldn't worry too much about fancy error recovery with retries unless you are going to run this program often.

        You can of course parse the HTML content of the search results with regex, but this is a mess...

        my (@hrefs) = $mech->content =~ m|COMPLETEHREF= +/did_abfrage/detail.php\?id=\d+|g; print "$_\n" foreach @hrefs; #there are 5081 of these #these COMPLETEHREF's can be appended to a main url like this: my $example_url = ' +27/index.html?COMPLETEHREF= +.php?id=04146900';
        Then things get hairy and you will want to whip out some of that HTML parser voo-doo to parse the resulting table. Also, the character codings aren't consistent, for example the page has ä, but not ü which is coded as &uuml;
Re: getting started with LWP and HTML::TokeParser
by BrimBorium (Friar) on Oct 10, 2010 at 18:36 UTC

    I just want to adwise you to check if you are allowed to use your script on a government website. Just because it's possible it does not always mean it's a good idea. Be sensitive about data privacy when automating web things. When using the swiss army chanisaw, be aware you might cut your leg off if you're careless.

    BTW: I don't feel well with people posting valid mail adresses and phone numbers instead of example data ... because I hate spam.

    Did you read Choosing a username? Do you really want to stay Perlbeginner1 whole life? Just for curiosity ;-)

      Hi there Brimborium

      thx for sharing you ideas!

      since i am a teacher and since i am working in the field of education for years i know very well what i do! I have no troubles with parsing this govermental site!

      The data i am trying to get are readable - so i mechanize this reading...

      BTW: one word regarding the data: These data are offical Adress-data - names and numbers of shools - nothing else.

      some general adress-sets that contain nothing really sensitive!

      but again - thx for sharing your ideas. BTW: what is wrong with my username; i am a beginner.

      regards - perl beginner1

        There is nothing wrong with your name, but you will stay a beginner forever, at least with your name...

        I just want to point out to choose the right way to do things to avoid causing more damage than benefit. You can use a club to get a fly away from your friends shoulder, but he may not recover from your favour. You have a powerful tool with perl, I just want to be sure that you use it wisely.

        Reading a lot of files in a short time from a public server could be misinterpreted... if you are a teacher you should be aware of the consequences ... you might kill the server with a buggy script. I just want to predict you from having to say "Oh, I did not WANT that, it was really not my intention"

        A phonebook is also available to public, but I dislike the idea of having it machine redable for dialing bots ...

        I'm a software developer since many years and if you're using example or test code on real data on a public server, you really do not know what you do form my point of view.

