Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

How can I find the contents of an HTML tag?

( #10698=categorized question: print w/ replies, xml ) Need Help??
Contributed by Anonymous Monk on May 09, 2000 at 05:28 UTC
Q&A  > data formatting


Description:

Given some HTML like

<TR><TD>Foo:</TD><TD> bar </TD></TR>
I want to extract the string "bar" into a variable.

Note the space on each side will be there, even if the value string is "".

Answer: How can I find the contents of an HTML tag?
contributed by mikfire

Assuming that the distinguishing characteristic of the TD entry you want to extract is the leading and following space, I'd suggested a regex something like this:

my( $var ) = $html =~ m#<TD> (.*?) </TD>#; print "We found it: $var\n" if defined $var;

The part inside the capturing parens (.*?) says to save any characters found, possibly none. It says to take the fewest possible characters to complete the match — i.e., be non-greedy.

The only way you will be able to know if the match succeeded is to test for definedness. Testing for true/false will fail on the empty case because perl treats the empty string as false.

If the <TD>Foo:</TD> part will always occur immediately in front of the <TD> instances you're interested in, we can make the regex more robust:

m#<TD>Foo:</TD><TD> (.*?) </TD>#

Answer: How can I find the contents of an HTML tag?
contributed by chromatic

One obvious regex-based solution is:

if ( $html =~ m#<TR><TD>Foo:</TD><TD> (.*) </TD></TR># ) { $var = $1; }

Answer: How can I find the contents of an HTML tag?
contributed by nuance

You could use a regular expression with the /g modifier:

m#<TD>(.*?)</TD>#g

This will return a list of all the matches.

You can then select whichever term you want from the list:

my $html = "<TR><TD>Foo:</TD><TD> bar </TD></TR>"; my $var = ( $html =~ m#<TD>(.*?)</TD>#g )[1];

Note that you have to modify the asterisk with the question mark to specify non-greedy matching. If you don't, you will get just one big match, like

Foo:&lt;/TD&gt;&lt;TD&gt; bar
— probably not what you wanted!

You should also add the /s modifier, if the string you need to extract breaks across multiple lines. Specifically, /s allows the dot to match newline characters along with all the other characters.

Answer: How can I find the contents of an HTML tag?
contributed by davorg

Parsing HTML using regular expressions can be a complete minefield if you don't know exactly what your data will look like.

In situations like this, you're far better off using a real parser, such as HTML::Parser or one of its sub-classes.

Please (register and) log in if you wish to add an answer



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others rifling through the Monastery: (7)
    As of 2014-12-25 02:45 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      Is guessing a good strategy for surviving in the IT business?





      Results (159 votes), past polls