Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re: using the headers method of HTML::TableExtract to find an image

by brainpan (Monk)
on Apr 03, 2001 at 00:16 UTC ( #69112=note: print w/replies, xml ) Need Help??


in reply to using the headers method of HTML::TableExtract to find an image

I should have known better than to create a root node that only contained one line of actual perl code. Let's try this again, this time fueled by a bit more sleep.

My goal is to extract the data from a table (for this example we'll use this one), where I know only the headers for the fields. Thanks to HTML::TableExtract's headers method, this is quite simple:

use strict; use HTML::TableExtract; # I'm using LWP in the real code, but this is a minimalistic attempt a +t a working example my $html_doc_name = '/tmp/symbols.html'; my $html_doc_string; my $te = new HTML::TableExtract( headers => ['Character', 'Entity'] ); my $ts; my $row; undef $/; # the absence of this one little line always causes me + so much trouble open(HTML, $html_doc_name) or die "Couldn't open html file: $!\n"; $html_doc_string = <HTML>; close(HTML) or die "Couldn't close html file: $!\n"; $te->parse($html_doc_string); # Examine all matching tables foreach $ts ($te->table_states) { print "Table (", join(',', $ts->coords), "):\n"; foreach $row ($ts->rows) { print join("\t\t", @$row), "\n"; } }


This gives me the data I'm looking for. However, if the header I'm looking for is an image (usually of stylized text stating what the columns represent), this ceases to work. Say that, rather than those columns being labeled 'Character' and 'Entity' they were <img src="http://www.htmlhelp.com/images/Character.jpeg"> and <img src="http://www.htmlhelp.com/images/Entity.jpeg">, respectively. With this one, seemingly minor change to the headers, this code suddenly won't work, even if I make the appropriate modifications to the header criteria. As stated above, my suspicion is that this is due to the fact that, as the image urls are now HTML::Parser objects rather than plain text, HTML::TableExtract is skipping over them and looking only in the plaintext portion of the html. My question is this: is there a way to make TableExtract look in the image tags for my selection criteria? If I can't do that directly, can I tell HTML::Parser itself that I'd like it to treat image tags as plain text, (presumably making TableExtract work as it does with plaintext headers)? Is there perhaps some other method entirely which I should be using?

Hopefully this time my question is clear enough to warrant something other than upvotes for effort. :).

And no, I don't own 27 pairs of sweatpants.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://69112]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2019-05-23 04:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you enjoy 3D movies?



    Results (142 votes). Check out past polls.

    Notices?
    • (Sep 10, 2018 at 22:53 UTC) Welcome new users!