Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Regexp to extract HTML link data

by hatter (Pilgrim)
on Jul 17, 2003 at 12:30 UTC ( #275196=perlquestion: print w/ replies, xml ) Need Help??
hatter has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to work out a regexp which when given either:
$in = '<td><img src="foo.jpg"><a href="index3.html">New index</a></td> +'; or $in = '<td><a href="index3.html">New index</a></td>';
will give me the link data, regardless, and the image data, should there be one. I've tried various combinations after the initial my ($new,$hit) = ($in =~ m#(foo.jpg)?.*(<a href=.*</a>)#m); It looks simple enough, but has stumped a couple of my friends, too. I'm trying to do it in a single regexp - although the actual problem could check for the bits separately, it's got me stumped enough to want an answer, out of curiousity (and doing it in two bits makes the rest of the code more complicated) FWIW, the link data varies, the image data is static.

the hatter

Title edit by tye

Comment on Regexp to extract HTML link data
Select or Download Code
Re: Regexp riddles
by dragonchild (Archbishop) on Jul 17, 2003 at 12:36 UTC
    Don't parse HTML with a regex. Use HTML::Parser - that's why it exists.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: Regexp riddles
by Abigail-II (Bishop) on Jul 17, 2003 at 12:38 UTC
    Your problem isn't well defined. How much variation can there be? If you have to support all HTML possibilities, you'd be much better of using a parser.

    Anyway, here's an untested attempt. Most likely, it breaks on your second example:

    my (undef, $new, $hit) = $in =~ m{ <td><img \s+ src \s* = \s* (["']) foo[.]jpg \1 > (<a \s+ href \s* = (["']) [^"']* \4 > [^<]* </a>}ix;

    Abigail

Re: Regexp riddles
by broquaint (Abbot) on Jul 17, 2003 at 12:42 UTC
    Under the blind assumption that your data won't be changing too much or becomes 'faulty' (otherwise you'd be using a parser right?) then something like this ought do
    my $re = qr{ (?: <img \s+ .*? src=" ([^"]+) " .*? > )? <a \s+ .*? href=" ([^"]+) " .*? > }x; $in = '<td><img src="foo.jpg"><a href="index3.html">New index</a></td>'; my($href, $img) = grep defined, reverse $in =~ $re; print "href - $href\nimg - $img\n"; $in = '<td><a href="index3.html">New index</a></td>'; ($href, $img) = grep defined, reverse $in =~ $re; print "href - $href\nimg - $img\n"; __output__ href - index3.html img - foo.jpg href - index3.html img -
    See. perlre for more info.
    HTH

    _________
    broquaint

      Thanks, that looks like the ticket. And your assumptions are correct - HTML parsers, um, no thank you. The input happens to be HTML, but it's very simple, fairly fixed format, and the problem could just as easily be expressed without HTML tags. And I'm hoping to wrap it all up in a map() (lots of data to iterate over) so it's much neater.

      Now, off to spend more time staring hard at the solution until its lessons burn themselves deep into my brain.

      thanks

      the hatter

Re: Regexp riddles
by Aristotle (Chancellor) on Jul 17, 2003 at 13:08 UTC
    Try HTML::LinkExtractor - should be less trouble than figuring out a regex that works.

    Makeshifts last the longest.

Re: Regexp riddles
by demerphq (Chancellor) on Jul 17, 2003 at 14:37 UTC

    I dunno, this didnt seem to be too dificult, so I have a feeling ive gone wrong here somewhere. But heres my go.

    use strict; use warnings; foreach ('<td><img src="foo.jpg">'. '<a href="index3.html">New index</a></td>', '<td><a href="index3.html">New index</a></td>') { if (/<td>(?:<img[ ]src="([^"]+)">)? <a[ ]href="([^"]+)">((?:(?!<\/a>).)*) <\/a>/six) { print "Matched!\tImg=", ($1 ? $1 : 'None'), "\tLink: $2\t Link Text: $3\n"; } }

    sorry about the weird look of the code its mostly like that to fit average settings on the site.


    ---
    demerphq

    <Elian> And I do take a kind of perverse pleasure in having an OO assembly language...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://275196]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (15)
As of 2014-07-29 15:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (220 votes), past polls