Contributed by Anonymous Monk on May 09, 2000 at 05:28 UTC
Q&A  > data formatting


Given some HTML like

<TR><TD>Foo:</TD><TD> bar </TD></TR>
I want to extract the string "bar" into a variable.

Note the space on each side will be there, even if the value string is "".

Answer: How can I find the contents of an HTML tag?
contributed by mikfire

Assuming that the distinguishing characteristic of the TD entry you want to extract is the leading and following space, I'd suggested a regex something like this:

my( $var ) = $html =~ m#<TD> (.*?) </TD>#; print "We found it: $var\n" if defined $var;

The part inside the capturing parens (.*?) says to save any characters found, possibly none. It says to take the fewest possible characters to complete the match — i.e., be non-greedy.

The only way you will be able to know if the match succeeded is to test for definedness. Testing for true/false will fail on the empty case because perl treats the empty string as false.

If the <TD>Foo:</TD> part will always occur immediately in front of the <TD> instances you're interested in, we can make the regex more robust:

m#<TD>Foo:</TD><TD> (.*?) </TD>#

Answer: How can I find the contents of an HTML tag?
contributed by chromatic

One obvious regex-based solution is:

if ( $html =~ m#<TR><TD>Foo:</TD><TD> (.*) </TD></TR># ) { $var = $1; }

Answer: How can I find the contents of an HTML tag?
contributed by nuance

You could use a regular expression with the /g modifier:


This will return a list of all the matches.

You can then select whichever term you want from the list:

my $html = "<TR><TD>Foo:</TD><TD> bar </TD></TR>"; my $var = ( $html =~ m#<TD>(.*?)</TD>#g )[1];

Note that you have to modify the asterisk with the question mark to specify non-greedy matching. If you don't, you will get just one big match, like

Foo:&lt;/TD&gt;&lt;TD&gt; bar
— probably not what you wanted!

You should also add the /s modifier, if the string you need to extract breaks across multiple lines. Specifically, /s allows the dot to match newline characters along with all the other characters.

Answer: How can I find the contents of an HTML tag?
contributed by davorg

Parsing HTML using regular expressions can be a complete minefield if you don't know exactly what your data will look like.

In situations like this, you're far better off using a real parser, such as HTML::Parser or one of its sub-classes.

Please (register and) log in if you wish to add an answer

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.