Strip HTML line breaks from list of URLs

on May 08, 2003 at 19:53 UTC
hello fellow perl people , I have a really quick question i am doing the below
@Old_URL = grep /href=/i, split(/[<\s>]+/, $input);
and in the output i am getting
href="/offices/OPA/bios.html"<br> href="/PressReleases/WhiteHouse.html"<br>
i want the br's to be not there , is there a way to tweak the split to do this , if so please let me know

Re: Strip HTML line breaks from list of URLs
by diotalevi (Canon) on May 08, 2003 at 20:06 UTC

    Two ideas: just snip them off with substr. $_ = substr $_, 0, length() - 4 for @Old_URL. Or use a a substitution: s{<br>}{} for @Old_URL.

      And what if the html source changes to xhtml and the <br>'s become <br />?

        Then it breaks. I didn't even pretend that the regex as given would parse HTML. It just alters a string which happens to have some HTML of a known format in it.

Re: Strip HTML line breaks from list of URLs
by cfreak (Chaplain) on May 08, 2003 at 20:29 UTC
Re: Strip HTML line breaks from list of URLs
by svsingh (Priest) on May 08, 2003 at 21:00 UTC
    Could we get a sample of the HTML you're parsing? I built my own $input string and ran it through your code. Everything came out fine. Thanks.
      <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html lang="en"> <head> <meta name="googlebot" content="noarchive" /> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8"> <title>Paige Announces Award of $6.4 Million in Reading First Grant Fo +r Kansas C hildren</title> <link rel="stylesheet" type="text/css" href="/edgov.css" /> <!--Start EDWeb Metadata - Do Not Modify Tags in Block Below--> <meta name="DC.title" content="Paige Announces Award of $6.4 Million i +n Reading First Grant For Kansas Children"> <meta name="DC.subject" scheme="EDWeb" type="ED.program" content="Read +ing First" > <meta name="DC.language" scheme="ISO 639-2" content="en"> <meta name="DC.description" content="Secretary Paige today announced t +hat Kansas will receive more than $6.4 million for the first year of a multi-yea +r Reading First grant."> <meta name="description" content="Secretary Paige today announced that + Kansas wi ll receive more than $6.4 million for the first year of a multi-year R +eading Fir st grant."> <meta name="DC.format" scheme="IMT" content="text/HTML"> <meta name="DC.publisher" content="US Department of Education (ED)"> <meta name="" scheme="EDWeb" content="Office of Public Affair +s (OPA)"> <meta name="" scheme="ISO 8601" content="2003-04-11"> <meta name="DC.subject" scheme="EDWeb" type="descriptor" content="Read +ing; Grant s"> <meta name="keywords" content="Reading; Grants; "> <meta name="DC-ED.audience" scheme="EDWeb" content="News Media"> <meta name="DC.type" scheme="EDWeb" content="Press Releases"> <!--End EDWeb Metadata - Do Not Modify Tags in Block Above--> </head> <BODY bgColor=#ffffff leftMargin=0 topMargin=0 marginwidth="0" marginh +eight="0"> <link rel="stylesheet" type="text/css" href="/edgov.css" /> <table width="100%" border="0" cellspacing="0" cellpadding="0" summary +="Table th at holds all the top header"> <tr><!--//Top right hand links//--> <td align="right" class="smallcontent_new" colspan="1"><a href="#conte +nt" class= "small">Skip Navigation</a></td> <td colspan="2" align="right" class="smallcontent_new"> <a href="/spanishresources.jsp" target="_top" class="small">Recursos e +n Espa&nti lde;ol</a>, <a href="/utilities/privacy.jsp" target="_top" class="sma +ll">Privac y, Security, Notices</a>&nbsp;</td> </tr> <tr> <td width="91" rowspan="2"><a href="/"><img src="/images/logo.gif" wid +th="91" he ight="75" border="0" alt="U.S. Department of Education" /></a></td> <td width="100%" class="background1_new" align="center"><img src="/ima +ges/edhead er_title.gif" width="385" height="54" border="0" usemap="#header" alt= +"" /><map name="header"><area alt="" coords="91,26,178,52" href="/perso +nalize/dis play.jsp" target="_top" /></map></td> <td height="54" width="284" class="background1_new" align="right"> <table width="284" border="0" cellpadding="0" cellspacing="0" bgcolor= +"#000066" summary="Table holds search and subnavigation"> <tr> <td colspan="3" align="right" class="whitesmall"><a href="/about/welco +me.jsp" ta rget="_top" class="smallwhite_new"><b>About ED</b></a> | <a href="/top +ics/srchTo pics.jsp" target="_top" class="smallwhite_new"><b>A-Z Index</b></a> | +<a href="/ utilities/siteMap.jsp" target="_top" class="smallwhite_new"><b>Site Ma +p</b></a> | <a href="/utilities/contact.jsp" target="_top" class="smallwhite_new +"><b>Conta ct Us</b></a>&nbsp;</td> </tr> <tr><form name="seek1" method="GET" accept-charset="iso-8859-1" action +="/search/ searchResList.jsp" target="_top"> <input type=hidden name="st" value="0"> <input type=hidden name="colParam" value="ED"> <input type=hidden name="lk" value="1"> <td class="white">&nbsp;<strong><label for="search">Search: </label></ +strong>&nb sp;<input type="text" name="qt" size="15" class="monospace" id="search +" maxlengt h="1991">&nbsp;</td> <td><input type=image src="/images/go_b.gif" width="35" height="31" bo +rder="0" a lt="GO"></td> <td>&nbsp;<a href="/search/advSearchForm.jsp" target="_top" class="sma +llwhite_ne w"><b>Advanced</b></a>&nbsp;</td></form> </tr> </table> </td> </tr> <tr> <td colspan="2" width="100%" background="/images/navbg.gif" align="cen +ter"> <!--//Nested table that holds the navigation tabs//--> <table width="669" border="0" cellspacing="0" cellpadding="0" summary= +"All navig ation between categories is in this table"> <tr> <td><a href="/index.jsp" target="_top" OnMouseOver=" +c='/images /edhome_b1.gif'" OnMouseOut="document.edhome.src='/images/ +f'"><img s rc="/images/edhome_b0.gif" width="48" height="21" border="0" alt="Home +" name="ed home" /></a></td> <td><a href="/audience/audience.jsp" target="_top" OnMouseOver="docume +nt.edaudie nce.src='/images/edaudience_b1.gif'" OnMouseOut="document.edaudience.s +rc='/image s/edaudience_b0.gif'"><img src="/images/edaudience_b0.gif" width="73" +height="21 " border="0" alt="Information for..." name="edaudience" /></a></td> <td><a href="/topics/topics.jsp?&top=Grants+%26+Contracts" target="_to +p" OnMouse Over="document.edgrants.src='/images/edgrants_b1.gif'" OnMouseOut="doc +ument.edgr ants.src='/images/edgrants_b0.gif'"><img src="/images/edgrants_b0.gif" + width="12 6" height="21" border="0" alt="Grants and Contracts" name="edgrants" / +></a></td> <td><a href="/topics/topics.jsp?&top=Financial+Aid" target="_top" OnMo +useOver="d ocument.edfinancial.src='/images/edfinancial_b1.gif'" OnMouseOut="docu +ment.edfin ancial.src='/images/edfinancial_b0.gif'"><img src="/images/edfinancial +_b0.gif" w idth="94" height="21" border="0" alt="Financial Aid" name="edfinancial +" /></a></ td> <td><a href="/topics/topics.jsp?&top=Education+Resources" target="_top +" OnMouseO ver="document.ededucation.src='/images/ededucation_b1.gif'" OnMouseOut +="document .ededucation.src='/images/ededucation_b0.gif'"><img src="/images/ededu +cation_b0. gif" width="139" height="21" border="0" alt="Education Resources" name +="ededucat ion" /></a></td> <td><a href="/topics/topics.jsp?&top=Research+%26+Stats" target="_top" + OnMouseOv er="document.edresearch.src='/images/edresearch_b1.gif'" OnMouseOut="d +ocument.ed research.src='/images/edresearch_b0.gif'"><img src="/images/edresearch +_b0.gif" w idth="118" height="21" border="0" alt="Research and Stats" name="edres +earch" />< /a></td> <td><a href="/topics/topics.jsp?&top=Policy" target="_top" OnMouseOver +="document .edpolicy.src='/images/edpolicy_b1.gif'" OnMouseOut="document.edpolicy +.src='/ima ges/edpolicy_b0.gif'"><img src="/images/edpolicy_b0.gif" width="71" he +ight="21" border="0" alt="Policy" name="edpolicy" /></a></td> </tr> </table><!--//End of nested navigation table//--> </td> </tr> <!--//Write the different subnavs based on directory//--> <tr>
        The input you posted contains no <br> tags.
Re: Strip HTML line breaks from list of URLs
by Llew_Llaw_Gyffes (Scribe) on May 09, 2003 at 00:30 UTC
    Without knowing what precisely you're trying to do overall, there's a certain amount of guesswork involved. But, that said, could you not simply do this?
    @Old_URL = grep /href=/i, split(/(<|>|\s)+/, $input);
    My recollection, which may be flawed, is that you cannot use classes such as \w and \s in an enumerated character class in a regex.

