Strip HTML line breaks from list of URLs

by Anonymous Monk
hello fellow perl people , I have a really quick question i am doing the below
@Old_URL = grep /href=/i, split(/[<\s>]+/, $input);
and in the output i am getting
href="/offices/OPA/bios.html"<br> href="/PressReleases/WhiteHouse.html"<br>
i want the br's to be not there , is there a way to tweak the split to do this , if so please let me know

Re: Strip HTML line breaks from list of URLs
by diotalevi (Canon) on May 08, 2003 at 20:06 UTC

    Two ideas: just snip them off with substr. $_ = substr $_, 0, length() - 4 for @Old_URL. Or use a a substitution: s{<br>}{} for @Old_URL.

      And what if the html source changes to xhtml and the <br>'s become <br />?

        Then it breaks. I didn't even pretend that the regex as given would parse HTML. It just alters a string which happens to have some HTML of a known format in it.

Re: Strip HTML line breaks from list of URLs
by cfreak (Chaplain) on May 08, 2003 at 20:29 UTC
Re: Strip HTML line breaks from list of URLs
by svsingh (Priest) on May 08, 2003 at 21:00 UTC
    Could we get a sample of the HTML you're parsing? I built my own $input string and ran it through your code. Everything came out fine. Thanks.
        The input you posted contains no <br> tags.
Re: Strip HTML line breaks from list of URLs
by Llew_Llaw_Gyffes (Beadle) on May 09, 2003 at 00:30 UTC
    Without knowing what precisely you're trying to do overall, there's a certain amount of guesswork involved. But, that said, could you not simply do this?
    @Old_URL = grep /href=/i, split(/(<|>|\s)+/, $input);
    My recollection, which may be flawed, is that you cannot use classes such as \w and \s in an enumerated character class in a regex.

