Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
I should have known better than to create a root node that only contained one line of actual perl code. Let's try this again, this time fueled by a bit more sleep.

My goal is to extract the data from a table (for this example we'll use this one), where I know only the headers for the fields. Thanks to HTML::TableExtract's headers method, this is quite simple:

use strict; use HTML::TableExtract; # I'm using LWP in the real code, but this is a minimalistic attempt a +t a working example my $html_doc_name = '/tmp/symbols.html'; my $html_doc_string; my $te = new HTML::TableExtract( headers => ['Character', 'Entity'] ); my $ts; my $row; undef $/; # the absence of this one little line always causes me + so much trouble open(HTML, $html_doc_name) or die "Couldn't open html file: $!\n"; $html_doc_string = <HTML>; close(HTML) or die "Couldn't close html file: $!\n"; $te->parse($html_doc_string); # Examine all matching tables foreach $ts ($te->table_states) { print "Table (", join(',', $ts->coords), "):\n"; foreach $row ($ts->rows) { print join("\t\t", @$row), "\n"; } }


This gives me the data I'm looking for. However, if the header I'm looking for is an image (usually of stylized text stating what the columns represent), this ceases to work. Say that, rather than those columns being labeled 'Character' and 'Entity' they were <img src="http://www.htmlhelp.com/images/Character.jpeg"> and <img src="http://www.htmlhelp.com/images/Entity.jpeg">, respectively. With this one, seemingly minor change to the headers, this code suddenly won't work, even if I make the appropriate modifications to the header criteria. As stated above, my suspicion is that this is due to the fact that, as the image urls are now HTML::Parser objects rather than plain text, HTML::TableExtract is skipping over them and looking only in the plaintext portion of the html. My question is this: is there a way to make TableExtract look in the image tags for my selection criteria? If I can't do that directly, can I tell HTML::Parser itself that I'd like it to treat image tags as plain text, (presumably making TableExtract work as it does with plaintext headers)? Is there perhaps some other method entirely which I should be using?

Hopefully this time my question is clear enough to warrant something other than upvotes for effort. :).

And no, I don't own 27 pairs of sweatpants.

In reply to Re: using the headers method of HTML::TableExtract to find an image by brainpan
in thread using the headers method of HTML::TableExtract to find an image by brainpan

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others pondering the Monastery: (3)
    As of 2019-06-26 00:04 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?
      Is there a future for codeless software?



      Results (108 votes). Check out past polls.

      Notices?