comment on

No offense to andye (still good advice), but parsing HTML with a regexp is a bad idea.

Stick with a tried-and true module. It is far less likely to break on you (usually due to bad HTML, not code) and allows for further learning.

Take for example HTML::TokeParser:

my $content = get($url);
my $ref = \$content;
my $p = HTML::TokeParser->new($ref);
my $token;
while ($token = $p->get_tag("a")) {
    my $href = $token->[1]{href};
    my $text = $p->get_trimmed_text("/a");
    print "$href => $text";
}
## Should work...
[download]

This looks intimitating, and it is. :) However, by learning how to use modules like TokeParser you'll not only get a better handle on what you want to do, but you'll be learning more about Perl, as well.

Also, if you plan on doing this often, I suggest picking up a copy of Perl & LWP. It's a good resource for interacting with websites.

John J Reiser
newrisedesigns.com

In reply to Re:^2 (nrd) Extracting attributes from anchor tags in an HTML page by newrisedesigns
in thread Extracting href's by Scott_J

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


more useful options
	PerlMonks