Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1

Perlbeginner1 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
OT: What a hideous sofa! (Re: Parsing with HTML::TreeBuilder::LibXML...) by roboticus (Chancellor) on Sep 25, 2010 at 19:20 UTC
Perlbeginner1: I don't intend to be rude, but your post sounds to me like someone walking into someones home and exclaiming "Geez! What an ugly sofa! You should buy a new one!" All the while, you're wearing a burnt-orange shirt and a lime-green suit with purple socks. I personally don't want all the glitter and beads on a website. I just want something that works well. And things are pretty nice, here, thanks. Anyway, if each file contains one chunk of text you want, then I'd do something similar to: use strict; use warnings; open OUF, '>', 'CollectedInfo.txt' or die; while (my $FName = shift) { print OUF doit($FName), "\n\n"; } close OUF; sub doit { my $text; if (! open INF, '<', $_[0]) { print "Can't open file '$_[0]', skipped!\n"; return ''; } # slurp in the entire file local $/; $text = <INF>; # Trim off everything before "Hit # of #" $text =~ s/^.?Hit\s+\d+of\s\d+//; # Trim off everything after "listed since:" $text =~ s/listed since:.$//; # Code for cleaning up the text goes here # (left as exercise for the reader) # Return desired chunk of text return $text; } [download] Note: untested, you can keep both pieces if (when) it breaks, etc. ...roboticus	[reply] [d/l]
Re: OT: What a hideous sofa! (Re: Parsing with HTML::TreeBuilder::LibXML...) by Erez (Priest) on Sep 26, 2010 at 07:59 UTC
someone walking into someones home and exclaiming "Geez! What an ugly sofa! You should buy a new one!" All the while, you're wearing a burnt-orange shirt and a lime-green suit with purple socks. I don't care if you're wearing a 10,000$ Armani suit, you don't go into someone's house and start criticising his sofas. I can take criticism, but put the limit in sofas. "Principle of Least Astonishment: Any language that doesn’t occasionally surprise the novice will pay for it by continually surprising the expert..	[reply]
Re^2: OT: What a hideous sofa! (Re: Parsing with HTML::TreeBuilder::LibXML...) by wfsp (Abbot) on Sep 26, 2010 at 08:44 UTC
Sofa? Sofa! It's a settee if you don't mind!	[reply]
Re: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 by ikegami (Patriarch) on Sep 25, 2010 at 19:20 UTC
unfortunatley this bullshit-forum killed all my html No, it rendered your HTML. Just use "`<p>`" at the start of every paragraph, and "`<c>...</c>`"* around blocks of computer text (code, data, input, output). You obviously know HTML, yet you're advocating having to learn yet another variant of UBB's markup to post here. That's a step backwards. * — You could use standard HTML tags for code blocks too, but `<c>..</c>` is much more convenient since it handles escaping "`&`" and "`<`" for you.	[reply] [d/l] [select]
Re: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 by AnomalousMonk (Archbishop) on Sep 25, 2010 at 21:00 UTC
So - now i have to retype it again! So re-typing two [ and two ] is such a burden? You definitely need another site!	[reply]
Re^2: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 by Perlbeginner1 (Scribe) on Sep 26, 2010 at 07:15 UTC
hello to all you many thanks for your answers. AND many thanks that you did not throw me out of this site.. Update - sorry for being so rude. But i did not know how to type in the right way .... So here again - and many many thanks for your answers. And i am happy that you did not ban me from this site! your perl-beginner	[reply]
Re^3: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 by GrandFather (Saint) on Sep 26, 2010 at 07:42 UTC
It's pretty unusual for anyone who posts a genuine question to be banned. We realise that most people have the capacity to learn. It's a real pleasure however to have someone not only prepared to learn, but to also apologise for their somewhat wayward start. We are happy to have you aboard! True laziness is hard work	[reply]
Re^3: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 by marto (Cardinal) on Sep 26, 2010 at 08:07 UTC
When you are posting, formatting advice is given above and more extensively below the text area. Then you press the 'preview' button and your post is rendered for your examination prior to submission. It wouldn't have looked how you wanted it then, had you checked it. If you'd read the formatting advice you'd have known how to fix it.	[reply]
Re: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 by shmem (Chancellor) on Sep 25, 2010 at 20:48 UTC
Update - unfortunatley this bullshit-forum killed all my html - this is very very bad! @ the admins: Please throw away this bullshit forum and buy a new one. Have a closer look at the forums at a. devshed b. phpBuilder c. openSuse. etc etc. They do not use such loosy software. They are not so typical 90ties that everyone can see that the forum is pretty outdated! So - now i have to retype it again! Please. We are all doing fine being outdated and zombies. PerlMonks is a site which steadfast (some say slowly) moves through the programming universe in its own time. Grab all the 860.000+ nodes avaliable here, convert them to a new shiny forum which is as usable as this obsolete engine, and come back again with a link. update: the number of nodes is one order of magnitude higher: 860.000+, not 86.000+ (thanks ambrus and AnomalousMonk)	[reply]
Re: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 by Anonymous Monk on Sep 25, 2010 at 19:03 UTC
Markup in the Monastery For example domains use official names reserved for such use, ie "example.com", "example.net","example.org" or "example.edu" We try to keep the language used here professional	[reply]
Re: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 by wfsp (Abbot) on Sep 26, 2010 at 09:44 UTC
This will extract the text and uses HTML::TokeParser::Simple which is a wrapper around HTML::Parser. I've add white space to the HTML for clarity. #! /usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new(*DATA) or die qq{cant parse html: $!\n}; my @text; while (my $t = $p->get_token){ next unless $t->is_text; my $txt = $t->as_is; if ($txt =~ /Hit/ .. $txt =~ /Listed since/){ for ($txt){ s/^\s+//; s/\s+$//; } next unless $txt; push @text, $txt; } } print qq{$_\n} for @text; __DATA__ <br><br> <h2>Hit 7 out of 120517</h2> <img src="http://myweb.org/images/wappen/ni.gif" class="wappen_pos" width="45" height="53" alt="country" title="countryname" /> <br> <div style="width: 40em;"><br> <div style="display: inline;"> <div class="logo_homepage"> <a class="img_inl" href="http://myWeb.org/222237520031111" > </a> </div> <br> <div class="fm_linkeSpalte"> <h2>name 1</h2> <br> <span class="schulart_text">type: one (for example)</span> <p class="einzel_text"> Adress: Paris, 3ne Boulevard Saint Lo<br /><br> Telefon:048 + 334555664 , Fax: 048 + 334555667<br /> MyWeb-Nummer: 222237520031111 <br /> Webmaster: <a href="mailto: webmaster@demosite.fr" class="p1" > master </a> <br /> </p> </div> <div> <p class="ta_left einzel_text"></p> </div> <br /> <div> <p class="ta_left einzel_text">Listed since: 20.08.2002</p> </div> <br><br><br><br> [download] `Hit 7 out of 120517 name 1 type: one (for example) Adress: Paris, 3ne Boulevard Saint Lo Telefon:048 + 334555664 , Fax: 048 + 334555667 MyWeb-Nummer: 222237520031111 Webmaster: master Listed since: 20.08.2002` [download] One of the anchor tags has an email address as the href attribute. Do you need to collect that as well?	[reply] [d/l] [select]
[Re7]: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 by Perlbeginner1 (Scribe) on Sep 26, 2010 at 12:03 UTC
Hello WFSP! hello roboticus! hello dear Community! Many thanks for the quick reply! And many many thanks to all other poster. Also many thanks to roboticus. I am very very happy to be here. I am glad! This is a great place to be! sure thing! Many thanks for the quick reply! All sounds great. Well - i am a beginner on Linux (i run OpenSuse 11.4 milestone1) or on a second machine OpenSuse 11.3 WFSP - and roboticus your approaches look very very impressive!One question comes up to mind: Perhaps i have not seen that you allready have answered this in the code you have written down. I am a bloody newbie. WFSP and roboticus i want to try out both approaches. They look impressive and i am convinced. Here the question: Where to put the large number of HTML-files, that need to be parsed: Do i have to call them in the script? How to do that!? At the moment they are in one folder - (Note; more that 10 000) I have a large number of HTML-files in a folder. I want to read and extract the content of each HTML-file and create a new single txt file with all the results. I'm only interested in the content having the above mentioned words. WFSP (& roboticus) - all you have written sounds very good and i am convinced. Ah - yes - the anchor-tag with the e-mail-adress is important too. I want to collect this e-mail-adress too. All the output should be written in only one new text file. It is important to have some clean output: That means i need to have the text with linebreaks WFSP - your approach seems to be great - and the output is right that what i want. THIS (above mentioned Format is great! It is preferred! I like this output Hit 7 out of 120517 name 1 type: one (for example) Adress: Paris, 3ne Boulevard Saint Lo Telefon:048 + 334555664 , Fax: 048 + 334555667 MyWeb-Nummer: 222237520031111 Webmaster: master Listed since: 20.08.2002 Superb! I need to have the results of the parsing written in this above mentioned format. All the results shoul be written down in only one text-file. That is important. Again - the question-(you probably see i am new to linux too): where to store the HTML-Files that need to be parsed!?... (and) where do the results are going to be written to!? Do i have to write these locations into the code. As well as the place where we store the results? BTW; on a windows-machine it has to look something like the following. doesn´t it!? `my $HTML_dir="C:\htmlperl";<br> my $output="C:\htmlperl\output.txt";<br> my $file = $ARGV[0];<br>` [download] or in general: `# folder where the HTML-files (that need to be parsed are stored my $html_dir = '/path/to/dir/with/html.files'; # fetch all.html-files from the directory my @html_files = File::Find::Rule->file->name( '*.html')->in( $html_di +r); for my $file ( @html_files ) { # parse the files # store all results that you got from the HTML-files in only one +txt-file. }` [download] Sorry for the stupid newbie-question!? ;-) But i am very very glad to have found a great (a superb place to be - and to ask all the questions that i have in mind! This is a great place to learn! Many thanks to all you! looking forward to hear from you... best regards perlbeginner1	[reply] [d/l] [select]
Re^8: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 by bart (Canon) on Sep 26, 2010 at 13:25 UTC
Where to put the large number of HTML-files, that need to be parsed: Do i have to call them in the script? How to do that!? I assume you mean "Do I have to name them all in a script?" No, you don't. You can put them anytwhere you like (but preferably not mixed up with the unrelated rest of your files) and use glob in your script to get a complete list of all those files in your script, in one directory -- or possibly even in adjacent directories": `# all html files in one directory my @files = glob 'path/to/dir/.html';` [download] or `# all html files in all (direct, slibling) subdirectoris in a director +y my @files = glob 'path/to/dir//*.html';` [download] If you need an even more elaborate directory structure, then you can use File::Find or one of its derivcatives to find the names of all html files, recursively. You then continue to parse each file, one at a time. You can use a regexp substitution to `s/\.html$/.txt/` to produce the name for the text file, if you want to put it right beside the original file. You can do a path substitution using `abs2rel`/`rel2abs` from File::Spec/File::Spec::Functions to put the new file in a different directory if you want to preserve the directory structure: `use File::Spec::Functions qw(rel2abs abs2rel); my $txt = rel2abs(abs2rel($file, $htmlroot), $txtroot); # relocate $txt =~ s/\.html$/.txt/; # extension` [download] If your directory tree is deep, you may have to create the target directory first, for example with mkpath before attempting to open the text file. If you want all text files to be in one and the same directory, you can just use File::Basename's `basename` to strip the directory from the path.	[reply] [d/l] [select]
Re: [Re7]: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 by wfsp (Abbot) on Sep 26, 2010 at 13:51 UTC
To get the email address as well replace my while loop with `my (@text, $found_start); while (my $t = $p->get_token){ my $txt; if ($t->is_text){ $txt = $t->as_is; for ($txt){ s/^\s+//; s/\s+$//; } next unless $txt; $found_start++ if $txt =~ /^Hit/; } elsif ( $found_start and $t->is_start_tag(q{a}) and $t->get_attr(q{href}) ) { my $href = $t->get_attr(q{href}); if ($href =~ /mailto:/i){ $txt = $href; } else { next; } } else{ next; } next unless $found_start; push @text, $txt; last if $txt =~ /Listed since/; }` [download] `Hit 7 out of 120517 name 1 type: one (for example) Adress: Paris, 3ne Boulevard Saint Lo Telefon:048 + 334555664 , Fax: 048 + 334555667 MyWeb-Nummer: 222237520031111 Webmaster: mailto: webmaster@demosite.fr master Listed since: 20.08.2002` [download] All the output should be written in only one new text file. Well, open a new text file for writing. :-) See open for how to do that. Bart has given some excellent tips on how to get a list of HTML files so that you can loop over them. Good luck!	[reply] [d/l] [select]
Re^2: [Re7]: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 by Perlbeginner1 (Scribe) on Sep 26, 2010 at 18:45 UTC


No such thing as a small change
	PerlMonks