OT: What a hideous sofa! (Re: Parsing with HTML::TreeBuilder::LibXML...)
by roboticus (Chancellor) on Sep 25, 2010 at 19:20 UTC
|
Perlbeginner1:
I don't intend to be rude, but your post sounds to me like someone walking into someones home and exclaiming "Geez! What an ugly sofa! You should buy a new one!" All the while, you're wearing a burnt-orange shirt and a lime-green suit with purple socks.
I personally don't want all the glitter and beads on a website. I just want something that works well. And things are pretty nice, here, thanks.
Anyway, if each file contains one chunk of text you want, then I'd do something similar to:
use strict;
use warnings;
open OUF, '>', 'CollectedInfo.txt' or die;
while (my $FName = shift) {
print OUF doit($FName), "\n\n";
}
close OUF;
sub doit {
my $text;
if (! open INF, '<', $_[0]) {
print "Can't open file '$_[0]', skipped!\n";
return '';
}
# slurp in the entire file
local $/;
$text = <INF>;
# Trim off everything before "Hit # of #"
$text =~ s/^.*?Hit\s+\d+of\s\d+//;
# Trim off everything after "listed since:"
$text =~ s/listed since:.*$//;
# Code for cleaning up the text goes here
# (left as exercise for the reader)
# Return desired chunk of text
return $text;
}
Note: untested, you can keep both pieces if (when) it breaks, etc.
...roboticus
| [reply] [d/l] |
|
someone walking into someones home and exclaiming "Geez! What an ugly sofa! You should buy a new one!" All the while, you're wearing a burnt-orange shirt and a lime-green suit with purple socks. I don't care if you're wearing a 10,000$ Armani suit, you don't go into someone's house and start criticising his sofas. I can take criticism, but put the limit in sofas.
"Principle of Least Astonishment: Any language that doesn’t occasionally surprise the novice will pay for it by continually surprising the expert..
| [reply] |
|
Sofa? Sofa! It's a settee if you don't mind!
| [reply] |
Re: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1
by ikegami (Patriarch) on Sep 25, 2010 at 19:20 UTC
|
unfortunatley this bullshit-forum killed all my html
No, it rendered your HTML. Just use "<p>" at the start of every paragraph, and "<c>...</c>"* around blocks of computer text (code, data, input, output).
You obviously know HTML, yet you're advocating having to learn yet another variant of UBB's markup to post here. That's a step backwards.
* — You could use standard HTML tags for code blocks too, but <c>..</c> is much more convenient since it handles escaping "&" and "<" for you.
| [reply] [d/l] [select] |
Re: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1
by AnomalousMonk (Archbishop) on Sep 25, 2010 at 21:00 UTC
|
| [reply] |
|
hello to all you
many thanks for your answers. AND many thanks that you did not throw me out of this site..
Update - sorry for being so rude. But i did not know how to type in the right way .... So here again - and many many thanks for your answers. And i am happy that you did not ban me from this site!
your perl-beginner
| [reply] |
|
| [reply] |
|
When you are posting, formatting advice is given above and more extensively below the text area. Then you press the 'preview' button and your post is rendered for your examination prior to submission. It wouldn't have looked how you wanted it then, had you checked it. If you'd read the formatting advice you'd have known how to fix it.
| [reply] |
Re: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1
by shmem (Chancellor) on Sep 25, 2010 at 20:48 UTC
|
Update - unfortunatley this bullshit-forum killed all my html - this is very very bad! @ the admins: Please throw away this bullshit forum and buy a new one. Have a closer look at the forums at a. devshed b. phpBuilder c. openSuse. etc etc. They do not use such loosy software. They are not so typical 90ties that everyone can see that the forum is pretty outdated! So - now i have to retype it again!
Please. We are all doing fine being outdated and zombies. PerlMonks is a site which steadfast (some say slowly) moves through the programming universe in its own time. Grab all the 860.000+ nodes avaliable here, convert them to a new shiny forum which is as usable as this obsolete engine, and come back again with a link.
update: the number of nodes is one order of magnitude higher: 860.000+, not 86.000+ (thanks ambrus and AnomalousMonk)
| [reply] |
Re: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1
by Anonymous Monk on Sep 25, 2010 at 19:03 UTC
|
- Markup in the Monastery
- For example domains use official names reserved for such use, ie "example.com", "example.net","example.org" or "example.edu"
- We try to keep the language used here professional
| [reply] |
Re: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1
by wfsp (Abbot) on Sep 26, 2010 at 09:44 UTC
|
#! /usr/bin/perl
use strict;
use warnings;
use HTML::TokeParser::Simple;
my $p = HTML::TokeParser::Simple->new(*DATA)
or die qq{cant parse html: $!\n};
my @text;
while (my $t = $p->get_token){
next unless $t->is_text;
my $txt = $t->as_is;
if ($txt =~ /Hit/ .. $txt =~ /Listed since/){
for ($txt){
s/^\s+//;
s/\s+$//;
}
next unless $txt;
push @text, $txt;
}
}
print qq{$_\n} for @text;
__DATA__
<br><br>
<h2>Hit 7 out of 120517</h2>
<img
src="http://myweb.org/images/wappen/ni.gif"
class="wappen_pos"
width="45"
height="53"
alt="country"
title="countryname"
/>
<br>
<div style="width: 40em;"><br>
<div style="display: inline;">
<div class="logo_homepage">
<a
class="img_inl"
href="http://myWeb.org/222237520031111"
>
</a>
</div>
<br>
<div class="fm_linkeSpalte">
<h2>name 1</h2>
<br>
<span class="schulart_text">type: one (for example)</span>
<p class="einzel_text">
Adress: Paris, 3ne Boulevard Saint Lo<br /><br>
Telefon:048 + 334555664 , Fax: 048 + 334555667<br />
MyWeb-Nummer: 222237520031111 <br />
Webmaster:
<a
href="mailto: webmaster@demosite.fr"
class="p1"
>
master
</a>
<br />
</p>
</div>
<div>
<p class="ta_left einzel_text"></p>
</div>
<br />
<div>
<p class="ta_left einzel_text">Listed since: 20.08.2002</p>
</div>
<br><br><br><br>
Hit 7 out of 120517
name 1
type: one (for example)
Adress: Paris, 3ne Boulevard Saint Lo
Telefon:048 + 334555664 , Fax: 048 + 334555667
MyWeb-Nummer: 222237520031111
Webmaster:
master
Listed since: 20.08.2002
One of the anchor tags has an email address as the href attribute. Do you need to collect that as well? | [reply] [d/l] [select] |
|
my $HTML_dir="C:\htmlperl";<br>
my $output="C:\htmlperl\output.txt";<br>
my $file = $ARGV[0];<br>
or in general:
# folder where the HTML-files (that need to be parsed are stored
my $html_dir = '/path/to/dir/with/html.files';
# fetch all.html-files from the directory
my @html_files = File::Find::Rule->file->name( '*.html')->in( $html_di
+r);
for my $file ( @html_files ) {
# parse the files
# store all results that you got from the HTML-files in only one
+txt-file.
}
Sorry for the stupid newbie-question!? ;-) But i am very very glad to have found a great (a superb place to be - and to ask all the questions that i have in mind! This is a great place to learn! Many thanks to all you!
looking forward to hear from you...
best regards
perlbeginner1
| [reply] [d/l] [select] |
|
# all html files in one directory
my @files = glob 'path/to/dir/*.html';
or
# all html files in all (direct, slibling) subdirectoris in a director
+y
my @files = glob 'path/to/dir/*/*.html';
If you need an even more elaborate directory structure, then you can use File::Find or one of its derivcatives to find the names of all html files, recursively.
You then continue to parse each file, one at a time.
You can use a regexp substitution to s/\.html$/.txt/ to produce the name for the text file, if you want to put it right beside the original file. You can do a path substitution using abs2rel/rel2abs from File::Spec/File::Spec::Functions to put the new file in a different directory if you want to preserve the directory structure:
use File::Spec::Functions qw(rel2abs abs2rel);
my $txt = rel2abs(abs2rel($file, $htmlroot), $txtroot); # relocate
$txt =~ s/\.html$/.txt/; # extension
If your directory tree is deep, you may have to create the target directory first, for example with mkpath before attempting to open the text file.
If you want all text files to be in one and the same directory, you can just use File::Basename's basename to strip the directory from the path.
| [reply] [d/l] [select] |
|
To get the email address as well replace my while loop with
my (@text, $found_start);
while (my $t = $p->get_token){
my $txt;
if ($t->is_text){
$txt = $t->as_is;
for ($txt){
s/^\s+//;
s/\s+$//;
}
next unless $txt;
$found_start++ if $txt =~ /^Hit/;
}
elsif (
$found_start
and
$t->is_start_tag(q{a})
and
$t->get_attr(q{href})
)
{
my $href = $t->get_attr(q{href});
if ($href =~ /mailto:/i){
$txt = $href;
}
else {
next;
}
}
else{
next;
}
next unless $found_start;
push @text, $txt;
last if $txt =~ /Listed since/;
}
Hit 7 out of 120517
name 1
type: one (for example)
Adress: Paris, 3ne Boulevard Saint Lo
Telefon:048 + 334555664 , Fax: 048 + 334555667
MyWeb-Nummer: 222237520031111
Webmaster:
mailto: webmaster@demosite.fr
master
Listed since: 20.08.2002
All the output should be written in only one new text file.
Well, open a new text file for writing. :-) See open for how to do that.
Bart has given some excellent tips on how to get a list of HTML files so that you can loop over them.
Good luck! | [reply] [d/l] [select] |
|