How can I find the contents of an HTML tag?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Given some HTML like

<TR><TD>Foo:</TD><TD> bar </TD></TR>
[download]

I want to extract the string "bar" into a variable.

Note the space on each side will be there, even if the value string is "".

Originally posted as a Categorized Question.

Comment on How can I find the contents of an HTML tag? Download Code

Replies are listed 'Best First'.
Re: How can I find the contents of an HTML tag? by mikfire (Deacon) on May 09, 2000 at 18:54 UTC
Assuming that the distinguishing characteristic of the TD entry you want to extract is the leading and following space, I'd suggested a regex something like this: `my( $var ) = $html =~ m#<TD> (.?) </TD>#; print "We found it: $var\n" if defined $var;` [download] The part inside the capturing parens `(.?)` says to save any characters found, possibly none. It says to take the fewest possible characters to complete the match — i.e., be non-greedy. The only way you will be able to know if the match succeeded is to test for definedness. Testing for true/false will fail on the empty case because perl treats the empty string as false. If the `<TD>Foo:</TD>` part will always occur immediately in front of the `<TD>` instances you're interested in, we can make the regex more robust: `m#<TD>Foo:</TD><TD> (.*?) </TD>#` [download]	[reply] [d/l] [select]
Re: How can I find the contents of an HTML tag? by chromatic (Archbishop) on May 09, 2000 at 06:05 UTC
One obvious regex-based solution is: `if ( $html =~ m#<TR><TD>Foo:</TD><TD> (.*) </TD></TR># ) { $var = $1; }` [download]	[reply] [d/l]
Re: How can I find the contents of an HTML tag? by nuance (Hermit) on May 09, 2000 at 15:30 UTC
You could use a regular expression with the `/g` modifier: `m#<TD>(.?)</TD>#g` [download] This will return a list of all the matches. You can then select whichever term you want from the list: `my $html = "<TR><TD>Foo:</TD><TD> bar </TD></TR>"; my $var = ( $html =~ m#<TD>(.?)</TD>#g )[1];` [download] Note that you have to modify the asterisk with the question mark to specify non-greedy matching. If you don't, you will get just one big match, like `Foo:</TD><TD> bar` [download] — probably not what you wanted! You should also add the `/s` modifier, if the string you need to extract breaks across multiple lines. Specifically, `/s` allows the dot to match newline characters along with all the other characters.	[reply] [d/l] [select]
Re: How can I find the contents of an HTML tag? by davorg (Chancellor) on Jun 21, 2000 at 12:39 UTC
Parsing HTML using regular expressions can be a complete minefield if you don't know exactly what your data will look like. In situations like this, you're far better off using a real parser, such as HTML::Parser or one of its sub-classes.	[reply]

Back to Seekers of Perl Wisdom