Re: Parsing HTML files to recover data...

in reply to Parsing HTML files to recover data...

This give you an array of hashes. It uses the second blockquote to trigger the start of the next record.

#!/usr/bin/perl

use strict;
use warnings;
use HTML::TokeParser::Simple;
use Data::Dumper;

my $p = HTML::TokeParser::Simple->new(*DATA)
  or die "couldn't parse DATA: $!\n";
  
my (@records, %record, $start);

while (my $t = $p->get_token){
  
  if ($t->is_start_tag('span')){
    if ($t->get_attr('class') and $t->get_attr('class') eq 'jobname'){
      $record{jobname} = $p->get_trimmed_text('span');
    }
    elsif ($t->get_attr('class') and $t->get_attr('class') eq 'jobseri
+al'){
      $record{jobserial} = $p->get_trimmed_text('span');
    }
    
    elsif ($t->get_attr('name') and $t->get_attr('name') eq 'em'){
      push @{$record{em}}, $p->get_trimmed_text('span');
    }
    elsif ($t->get_attr('name') and $t->get_attr('name') eq 'offices')
+{
        $record{offices} = $p->get_trimmed_text('span');
    }
  }

  if ($t->is_start_tag('blockquote')){
    next if exists $record{job_desc};
    $record{job_desc} = $p->get_trimmed_text('blockquote');
    #die Dumper \%record;
    push @records, \%record;
    %record = ();
  }
    
}

print Dumper \@records;

__DATA__
<p><b><span class="jobname">
Accounting Assistant, Level 2
</span>  

<span class="jobserial">(19203)</span>
<br />
Current members:
<br />
<span name="em">Plow, Elliot</span> 
<span name="em">Wang, Susan</span>
<br />

<span name=”offices”>Huston</span>
</p>
<blockquote>
Job descriptions here.
This block quoted text contains a job description and it what I am rea
+lly looking to recover. 
</blockquote>
<blockquote><a href="#top">Go to the top of this page</a>.</blockquote
+>
<blockquote><a href=”companyHR.html”>Check for open positions now!</a>
+</blockquote>
[download]

output:

$VAR1 = {
  'job_desc' => 'Job descriptions here. This block quoted text contain
+s a job description and it what I am really looking to recover.',
  'em' => [
    'Plow, Elliot',
    'Wang, Susan'
  ],
  'jobserial' => '(19203) Current members:',
  'jobname' => 'Accounting Assistant, Level 2'
};
[download]

update: see my reply below.

Comment on Re: Parsing HTML files to recover data... Select or Download Code

Replies are listed 'Best First'.
Re^2: Parsing HTML files to recover data... by wfsp (Abbot) on Nov 22, 2006 at 14:07 UTC
As kaif points out my script above did indeed produce questionable output and I also couldn't figure out why. After a lot of head scratching and cursing I noticed that the OPs data had a variety of quotes around the attribute values. I changed them to ordinary quotes and it now works ok. kaif++ for spotting the snag. #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; use Data::Dumper; my $p = HTML::TokeParser::Simple->new(*DATA) or die "couldn't parse DATA: $!\n"; my (@records, %record, $start, $i); while (my $t = $p->get_token){ if ($t->is_start_tag('span')){ if ($t->get_attr('class') and $t->get_attr('class') eq 'jobname'){ $record{jobname} = $p->get_trimmed_text('/span'); } elsif ($t->get_attr('class') and $t->get_attr('class') eq 'jobseri +al'){ $record{jobserial} = $p->get_trimmed_text('/span'); } elsif ($t->get_attr('name') and $t->get_attr('name') eq 'em'){ push @{$record{em}}, $p->get_trimmed_text('/span'); } elsif ($t->get_attr('name') and $t->get_attr('name') eq 'offices') +{ $record{offices} = $p->get_trimmed_text('/span'); } } if ($t->is_start_tag('blockquote')){ next if $i; my $txt = $p->get_trimmed_text(('blockquote')); $record{job_desc} = $txt; push @records, {%record}; %record = (); $i++; } } print Dumper \@records; __DATA__ <p><b> <span class="jobname">Accounting Assistant, Level 2</span> <span class="jobserial">(19203)</span> <br />Current members:<br /> <span name="em">Plow, Elliot</span> <span name="em">Wang, Susan</span> <br /> <span name="offices">Huston</span> </p> <blockquote> Job descriptions here. This block quoted text contains a job description and it what I am really looking to recover. </blockquote> <blockquote> <a href="#top">Go to the top of this page</a>. </blockquote> <blockquote> <a href="companyHR.html">Check for open positions now!</a> </blockquote> [download] `---------- Capture Output ---------- > "c:\perl\bin\perl.exe" _new.pl $VAR1 = [ { 'em' => [ 'Plow, Elliot', 'Wang, Susan' ], 'job_desc' => 'Job descriptions here. This block quoted text conta +ins a job description and it what I am really looking to recover.', 'offices' => 'Huston', 'jobserial' => '(19203)', 'jobname' => 'Accounting Assistant, Level 2' } ]; > Terminated with exit code 0.` [download]	[reply] [d/l] [select]
Re^2: Parsing HTML files to recover data... by kaif (Friar) on Nov 22, 2006 at 12:18 UTC
That's a questionable value for the 'jobserial' key. Looking at your code, I can't figure out why that could happen ...	[reply]

In Section Seekers of Perl Wisdom