As kaif points out my script above did indeed produce questionable output and I also couldn't figure out why.
After a lot of head scratching and cursing I noticed that the OPs data had a variety of quotes around the attribute values. I changed them to ordinary quotes and it now works ok.
kaif++ for spotting the snag.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TokeParser::Simple;
use Data::Dumper;
my $p = HTML::TokeParser::Simple->new(*DATA)
or die "couldn't parse DATA: $!\n";
my (@records, %record, $start, $i);
while (my $t = $p->get_token){
if ($t->is_start_tag('span')){
if ($t->get_attr('class') and $t->get_attr('class') eq 'jobname'){
$record{jobname} = $p->get_trimmed_text('/span');
}
elsif ($t->get_attr('class') and $t->get_attr('class') eq 'jobseri
+al'){
$record{jobserial} = $p->get_trimmed_text('/span');
}
elsif ($t->get_attr('name') and $t->get_attr('name') eq 'em'){
push @{$record{em}}, $p->get_trimmed_text('/span');
}
elsif ($t->get_attr('name') and $t->get_attr('name') eq 'offices')
+{
$record{offices} = $p->get_trimmed_text('/span');
}
}
if ($t->is_start_tag('blockquote')){
next if $i;
my $txt = $p->get_trimmed_text(('blockquote'));
$record{job_desc} = $txt;
push @records, {%record};
%record = ();
$i++;
}
}
print Dumper \@records;
__DATA__
<p><b>
<span class="jobname">Accounting Assistant, Level 2</span>
<span class="jobserial">(19203)</span>
<br />Current members:<br />
<span name="em">Plow, Elliot</span>
<span name="em">Wang, Susan</span>
<br />
<span name="offices">Huston</span>
</p>
<blockquote>
Job descriptions here.
This block quoted text contains a job description
and it what I am really looking to recover.
</blockquote>
<blockquote>
<a href="#top">Go to the top of this page</a>.
</blockquote>
<blockquote>
<a href="companyHR.html">Check for open positions now!</a>
</blockquote>
---------- Capture Output ----------
> "c:\perl\bin\perl.exe" _new.pl
$VAR1 = [
{
'em' => [
'Plow, Elliot',
'Wang, Susan'
],
'job_desc' => 'Job descriptions here. This block quoted text conta
+ins a job description and it what I am really looking to recover.',
'offices' => 'Huston',
'jobserial' => '(19203)',
'jobname' => 'Accounting Assistant, Level 2'
}
];
> Terminated with exit code 0.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
|
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.
|
|