Here's my stab at the parsing using
HTML::TreeBuilder. You should be able to hook out what you need.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Treebuilder;
use Data::Dumper;
$Data::Dumper::Indent = 1;
my $file_name = q{monk.html};
my $t = HTML::TreeBuilder->new;
$t->parse_file($file_name);
my @meta = $t->look_down(
q{_tag}, q{meta},
);
print q{-} x 10, qq{\n};
print qq{meta\n};
print q{-} x 10, qq{\n};
for my $ele (@meta){
my %attr = $ele->all_external_attr;
print Dumper \%attr;
print q{-} x 10, qq{\n};
}
my @links = $t->look_down(
q{_tag}, q{a},
);
print q{-} x 10, qq{\n};
print qq{links\n};
print q{-} x 10, qq{\n};
for my $ele (@links){
print $ele->as_trimmed_text, qq{\n};
my %attr = $ele->all_external_attr;
print Dumper \%attr;
print q{-} x 10, qq{\n};
}
my @areas = $t->look_down(
q{_tag}, q{area},
);
print q{-} x 10, qq{\n};
print qq{areas\n};
print q{-} x 10, qq{\n};
for my $ele (@areas){
my %attr = $ele->all_external_attr;
print Dumper \%attr;
print q{-} x 10, qq{\n};
}
output
----------
meta
----------
$VAR1 = {
'content' => 'this is some meta content',
'name' => 'description'
};
----------
$VAR1 = {
'content' => 'cars bikes sales call wheels engine fast',
'name' => 'keywords'
};
----------
$VAR1 = {
'http-equiv' => 'Refresh',
'content' => '300;URL=\'http://web.asite.com/tmpgifs/zz/\''
};
----------
----------
links
----------
linktext
$VAR1 = {
'href' => 'http://www.gone.com'
};
----------
linktext2
$VAR1 = {
'href' => 'http://www.gone2.com'
};
----------
linktext3 (bold word)
$VAR1 = {
'href' => 'http://www.gone3.com'
};
----------
text next to image
$VAR1 = {
'href' => 'www.linkfromimage.com1'
};
----------
$VAR1 = {
'href' => 'www.linkfromimage.com2'
};
----------
email link text
$VAR1 = {
'href' => 'mailto:deepheat@bbb.com'
};
----------
----------
areas
----------
$VAR1 = {
'href' => '/destinations/western-cape/map.aspx',
'coords' => '121,380,172,400',
'shape' => 'rect',
'title' => 'abc'
};
----------
$VAR1 = {
'href' => '/destinations/free-state/map.aspx',
'coords' => '262,214,301,241',
'title' => 'Free State Map',
'shape' => 'rect'
};
----------
$VAR1 = {
'alt' => 'Free State Map2',
'href' => '/destinations/free-state/map2.aspx',
'coords' => '262,214,301,241',
'shape' => 'rect'
};
----------
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.