Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Parsing HTML files to recover data...

by UrbanHick (Sexton)
on Nov 21, 2006 at 17:39 UTC ( #585311=perlquestion: print w/ replies, xml ) Need Help??
UrbanHick has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks!

I have another task here at the office that looks like a good chance to expand my limited Perl skills some more. Basically, thru a sad series of events we lost one of our servers and all the files on it. These files included a listing of jobs by number, pay rate, number of positions, ect.

Well, with the original files is now gone, all we have left is the HTML website descriptions of these jobs from two years ago. I am planning on using something like the LWP module to import the html file then parse it using the html code in the actual file to identify the stings I want to extract. Since the html file was generated from the original database file, it has a very regular form. A block of it might look something like this:

__DATA__ <p><b><span class="jobname"> Accounting Assistant, Level 2 </span> <span class="jobserial">(19203)</span> <br /> Current members: <br /> <span name=”em”>Plow, Elliot</span> <span name=“em”>Wang, Susan</span> <br /> <span name=”offices”>Huston</span> </p> <blockquote> Job descriptions here. This block quoted text contains a job description and it what I am rea +lly looking to recover. </blockquote> <blockquote><a href="#top">Go to the top of this page</a>.</blockquote +> <blockquote><a href=”companyHR.html”>Check for open positions now!</a> +</blockquote> __END__

What I am worried about is that I have a list of jobserial numbers here from the accounting department and my boss basically wants me to recover all of the Job description text, without grabbing the other two bits of text inside of the second and third sets of blockquote tags.

My question to you all is can this be done with a fancy regex or is there a module on Cpan that I missed that does this already? If so could you kindly point me in the right direction?

Thank you very much,

-UH

Comment on Parsing HTML files to recover data...
Download Code
Re: Parsing HTML files to recover data...
by blue_cowdawg (Prior) on Nov 21, 2006 at 19:22 UTC
        My question to you all is can this be done with a fancy regex or is there a module on Cpan that I missed that does this already? If so could you kindly point me in the right direction?

    Take a look at this CUFP I posted a while back for some insight on how to parse HTML and extract data from it. In it I use HTML::TableContentParser and LWP::UserAgent to pull in HTML extract data from tables and trigger alarms based on that data.

    Similarly you could use HTML::TokeParser to do much the same sorts of things with your <blockquote>...</blockquote> HTML syntax above.

    If you have time to do some reading take a look at the book Web, Graphics, & Perl/TK published by O'Reilly or Perl & LWP also published by O'Reilly. The latter being more my favorite on the subjects at hand.


    Peter L. Berghold -- Unix Professional
    Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg

      Thank you blue_cowdawg! These links look extremely promising. Getting my hands on the O'Reilly books might prove a bit difficult but the HTML::TokeParser looks very interesting indeed.

      -UH
Re: Parsing HTML files to recover data...
by ww (Bishop) on Nov 21, 2006 at 19:37 UTC
    If the .html is indeed as regular as portrayed (and all the job specs are in a single html page as I think you have suggested), this seems almost trivial.

    Based on the sample data you've shown and your (somewhat conflicted??) description of what your boss wants, I'm going to assume that you want to capture as much of the first <p> as included in the <jobname and jobserial <spans... and then skip over the (possibly outdated) incumbents, resuming your capture with the first blockquote.

    What I'm hoping this labored phasing suggests is that designing a regex (or group of same) is at least as much about analysis of the source data as about coding.

    In other words, it matters little whether you use a non-greedy lookahead or a negated class or something else to skip the 2nd and 3rd blockquotes (each of which happens to be immediately followed by an <a href... -- which makes them easy to distingish and thus eases the way to satisfying your "without grabbing" requirement) or any one of several other techniques that leap to mind.

    Similarly, analyasis of the initial info (again, assuming regularity) tells you you want to start capturing with the line following <p><b><span class="jobname">
    and the numeric data immediately following ="jobserial">( (or, if you prefer, the digits between the parentheses after ="jobserial"> (by which I mean to suggest an alternate algorithm/regex technique).

    Following any (or, better, several!) of the approaches suggested by the above may not be what you actually had in mind, but might still serve "to expand (your) Perl skills...."

    Of course, if you have text editor that will remove .html tags and supports regexen, one approach might be to simply capture the webpage source (by whatever means: save_as from a browser; LWP, etc), open the file in the editor, delete the tags and use two simple regexen to replace

    Go to the top of this page.
    Check for open positions now!

      Thank you very much for your thoughtful response. Basically what my boss wants it the data back, even if I have to "Cut and paste it.." The problem is that it is in several of these large but extremely regular html files. Thus I am hoping that mastering this technique will allow me to use it many times and save me hours of boredom.

      -UH
Re: Parsing HTML files to recover data...
by GrandFather (Cardinal) on Nov 21, 2006 at 21:13 UTC

    For this sort of 'Parse and extract' from HTML problem I reach for HTML::TreeBuilder. A first cut solution might look like:

    use strict; use warnings; use HTML::TreeBuilder; my $str = do {local $/; <DATA>}; my $tree = HTML::TreeBuilder->new; $tree->parse ($str); my @jobs; for my $para ($tree->find ('p')) { my @class = $para->look_down ('class', 'jobname'); my @names = $para->look_down ('name', 'em'); my @offices = $para->look_down ('name', 'offices'); next unless @class && @names && @offices; my $job = $class[0]->as_text ();; $job .= ': ' . join '; ', map {$_->as_text ()} @names; $job .= ' (' . join (', ', map {$_->as_text ()} @offices) . ')'; push @jobs, $job; } print join "\n", @jobs; __DATA__ <p><b><span class="jobname">Sandbagger, Level 2</span> <span class="jobserial">(19000)</span><br /> Current members:<br /> <span name='em'>Fred</span><span name='em'>Wilma</span><br /> <span name='offices'>Erewon</span> </p> <p><b><span class="jobname">Accounting Assistant, Level 2</span> <span class="jobserial">(19203)</span><br /> Current members:<br /> <span name='em'>Plow, Elliot</span><span name='em'>Wang, Susan</span>< +br /> <span name='offices'>Huston</span> </p> <blockquote> Job descriptions here. This block quoted text contains a job description and it what I am rea +lly looking to recover. </blockquote> <blockquote><a href="#top">Go to the top of this page</a>.</blockquote +> <blockquote><a href='companyHR.html'>Check for open positions now!</a> +</blockquote>

    Prints:

    Sandbagger, Level 2: Fred; Wilma (Erewon) Accounting Assistant, Level 2: Plow, Elliot; Wang, Susan (Huston)

    I imagine that there might be lists of people fillin a particular role by office. This code will munge all of those entries together and list the offices, thus losing the association of people with offices. That can be fixed by using a $para->look_down ('name', qr/em|offices/) then iterating over the list spitting out employee names and office names as appropriate.


    DWIM is Perl's answer to Gödel

      Thank you very much for your detailed response. The example code is most instructive. I will likely use this code as a frame work to fashion several programs to each parse a specific document for the data that the Powers That Be want extracted from each old webpage.

      -UH
Re: Parsing HTML files to recover data...
by djp (Hermit) on Nov 22, 2006 at 02:01 UTC
    Just restore the lost files from backup.

      This was before my time, but apparently the backups were on site during a fire and "lost too".

      I know, I also almost fell over when I heard that one for the first time too!

      -UH
        I trust somebody lost their job over that. They should have. Maybe you're the replacement?
Re: Parsing HTML files to recover data...
by Anonymous Monk on Nov 22, 2006 at 06:10 UTC
    I have had great success scraping data out of html files using XML::LibXML. This will parse the html into a DOM tree and allow XPath searches for the data. While this may be overkill both for the learning curve or CPU cycles, the code required for coaxing the data out of the files will be pretty simple. You may also end up with a code that is easily changed to solve any similar problem.
Re: Parsing HTML files to recover data...
by chinamox (Scribe) on Nov 22, 2006 at 07:11 UTC

    UrbanHick-

    while I am still a newbie myself, this might help you in some way or other:

    while ($page=~ /<blockquote>(.*?)<\/blockquote>/g) { print "captured text: $1\n"; }

    I think this will at least get you started down the right road with regexs. However I would suggest that you listen to the silverbacks around here and go with the HTML::X modules.

    good luck,

    -mox

      Is it just me, or should this response have been one of the first, rather than the sixth? I mean, go HTML::TableContentParser, HTML::TokeParser, HTML::TreeBuilder, Template::Extract, and all the other modules --- but seriously, as a first go ... ?

      # Assuming the page contents are in $_ ($jobname) = m|<span class="jobname">\s*(.*?)\s*</span>|s; ($jobserial) = m|<span class="jobserial">\s*\((.*?)\)\s*</span>|s; ($offices) = m|<span name="offices">\s*(.*?)\s*</span>|s; ($description) = m|<blockquote>\s*(.*?)\s*</blockquote>|s;

      Please excuse my surprise.

Re: Parsing HTML files to recover data...
by eriam (Beadle) on Nov 22, 2006 at 08:03 UTC
    I think you should go for Template::Extract http://search.cpan.org/dist/Template-Extract/ which is really great, it uses the Template syntax to reconstruct data structure from templates using your datafiles as source. Basically what it does is it does the opposite job of the Template module. I use it to extraxt information from emails. Great module, many thank to its author :)
Re: Parsing HTML files to recover data...
by wfsp (Abbot) on Nov 22, 2006 at 12:15 UTC
    This give you an array of hashes. It uses the second blockquote to trigger the start of the next record.
    #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; use Data::Dumper; my $p = HTML::TokeParser::Simple->new(*DATA) or die "couldn't parse DATA: $!\n"; my (@records, %record, $start); while (my $t = $p->get_token){ if ($t->is_start_tag('span')){ if ($t->get_attr('class') and $t->get_attr('class') eq 'jobname'){ $record{jobname} = $p->get_trimmed_text('span'); } elsif ($t->get_attr('class') and $t->get_attr('class') eq 'jobseri +al'){ $record{jobserial} = $p->get_trimmed_text('span'); } elsif ($t->get_attr('name') and $t->get_attr('name') eq 'em'){ push @{$record{em}}, $p->get_trimmed_text('span'); } elsif ($t->get_attr('name') and $t->get_attr('name') eq 'offices') +{ $record{offices} = $p->get_trimmed_text('span'); } } if ($t->is_start_tag('blockquote')){ next if exists $record{job_desc}; $record{job_desc} = $p->get_trimmed_text('blockquote'); #die Dumper \%record; push @records, \%record; %record = (); } } print Dumper \@records; __DATA__ <p><b><span class="jobname"> Accounting Assistant, Level 2 </span> <span class="jobserial">(19203)</span> <br /> Current members: <br /> <span name="em">Plow, Elliot</span> <span name="em">Wang, Susan</span> <br /> <span name=”offices”>Huston</span> </p> <blockquote> Job descriptions here. This block quoted text contains a job description and it what I am rea +lly looking to recover. </blockquote> <blockquote><a href="#top">Go to the top of this page</a>.</blockquote +> <blockquote><a href=”companyHR.html”>Check for open positions now!</a> +</blockquote>
    output:
    $VAR1 = { 'job_desc' => 'Job descriptions here. This block quoted text contain +s a job description and it what I am really looking to recover.', 'em' => [ 'Plow, Elliot', 'Wang, Susan' ], 'jobserial' => '(19203) Current members:', 'jobname' => 'Accounting Assistant, Level 2' };

    update: see my reply below.

      That's a questionable value for the 'jobserial' key. Looking at your code, I can't figure out why that could happen ...
      As kaif points out my script above did indeed produce questionable output and I also couldn't figure out why.

      After a lot of head scratching and cursing I noticed that the OPs data had a variety of quotes around the attribute values. I changed them to ordinary quotes and it now works ok.

      kaif++ for spotting the snag.

      #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; use Data::Dumper; my $p = HTML::TokeParser::Simple->new(*DATA) or die "couldn't parse DATA: $!\n"; my (@records, %record, $start, $i); while (my $t = $p->get_token){ if ($t->is_start_tag('span')){ if ($t->get_attr('class') and $t->get_attr('class') eq 'jobname'){ $record{jobname} = $p->get_trimmed_text('/span'); } elsif ($t->get_attr('class') and $t->get_attr('class') eq 'jobseri +al'){ $record{jobserial} = $p->get_trimmed_text('/span'); } elsif ($t->get_attr('name') and $t->get_attr('name') eq 'em'){ push @{$record{em}}, $p->get_trimmed_text('/span'); } elsif ($t->get_attr('name') and $t->get_attr('name') eq 'offices') +{ $record{offices} = $p->get_trimmed_text('/span'); } } if ($t->is_start_tag('blockquote')){ next if $i; my $txt = $p->get_trimmed_text(('blockquote')); $record{job_desc} = $txt; push @records, {%record}; %record = (); $i++; } } print Dumper \@records; __DATA__ <p><b> <span class="jobname">Accounting Assistant, Level 2</span> <span class="jobserial">(19203)</span> <br />Current members:<br /> <span name="em">Plow, Elliot</span> <span name="em">Wang, Susan</span> <br /> <span name="offices">Huston</span> </p> <blockquote> Job descriptions here. This block quoted text contains a job description and it what I am really looking to recover. </blockquote> <blockquote> <a href="#top">Go to the top of this page</a>. </blockquote> <blockquote> <a href="companyHR.html">Check for open positions now!</a> </blockquote>
      ---------- Capture Output ---------- > "c:\perl\bin\perl.exe" _new.pl $VAR1 = [ { 'em' => [ 'Plow, Elliot', 'Wang, Susan' ], 'job_desc' => 'Job descriptions here. This block quoted text conta +ins a job description and it what I am really looking to recover.', 'offices' => 'Huston', 'jobserial' => '(19203)', 'jobname' => 'Accounting Assistant, Level 2' } ]; > Terminated with exit code 0.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://585311]
Approved by Joost
Front-paged by andyford
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (7)
As of 2014-07-23 02:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (131 votes), past polls