http://www.perlmonks.org?node_id=10698

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Given some HTML like

<TR><TD>Foo:</TD><TD> bar </TD></TR>
I want to extract the string "bar" into a variable.

Note the space on each side will be there, even if the value string is "".

Originally posted as a Categorized Question.

Replies are listed 'Best First'.
Re: How can I find the contents of an HTML tag?
by mikfire (Deacon) on May 09, 2000 at 18:54 UTC

    Assuming that the distinguishing characteristic of the TD entry you want to extract is the leading and following space, I'd suggested a regex something like this:

    my( $var ) = $html =~ m#<TD> (.*?) </TD>#; print "We found it: $var\n" if defined $var;

    The part inside the capturing parens (.*?) says to save any characters found, possibly none. It says to take the fewest possible characters to complete the match — i.e., be non-greedy.

    The only way you will be able to know if the match succeeded is to test for definedness. Testing for true/false will fail on the empty case because perl treats the empty string as false.

    If the <TD>Foo:</TD> part will always occur immediately in front of the <TD> instances you're interested in, we can make the regex more robust:

    m#<TD>Foo:</TD><TD> (.*?) </TD>#

Re: How can I find the contents of an HTML tag?
by chromatic (Archbishop) on May 09, 2000 at 06:05 UTC

    One obvious regex-based solution is:

    if ( $html =~ m#<TR><TD>Foo:</TD><TD> (.*) </TD></TR># ) { $var = $1; }

Re: How can I find the contents of an HTML tag?
by nuance (Hermit) on May 09, 2000 at 15:30 UTC

    You could use a regular expression with the /g modifier:

    m#<TD>(.*?)</TD>#g

    This will return a list of all the matches.

    You can then select whichever term you want from the list:

    my $html = "<TR><TD>Foo:</TD><TD> bar </TD></TR>"; my $var = ( $html =~ m#<TD>(.*?)</TD>#g )[1];

    Note that you have to modify the asterisk with the question mark to specify non-greedy matching. If you don't, you will get just one big match, like

    Foo:&lt;/TD&gt;&lt;TD&gt; bar
    — probably not what you wanted!

    You should also add the /s modifier, if the string you need to extract breaks across multiple lines. Specifically, /s allows the dot to match newline characters along with all the other characters.

Re: How can I find the contents of an HTML tag?
by davorg (Chancellor) on Jun 21, 2000 at 12:39 UTC

    Parsing HTML using regular expressions can be a complete minefield if you don't know exactly what your data will look like.

    In situations like this, you're far better off using a real parser, such as HTML::Parser or one of its sub-classes.