Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Strip HTML line breaks from list of URLs

by Anonymous Monk
on May 08, 2003 at 19:53 UTC ( #256656=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

hello fellow perl people , I have a really quick question i am doing the below
@Old_URL = grep /href=/i, split(/[<\s>]+/, $input);
and in the output i am getting
href="/offices/OPA/bios.html"<br> href="/PressReleases/WhiteHouse.html"<br>
i want the br's to be not there , is there a way to tweak the split to do this , if so please let me know

2003-05-08 edit ybiC: retitle from "take a look plz"

Comment on Strip HTML line breaks from list of URLs
Select or Download Code
Re: Strip HTML line breaks from list of URLs
by diotalevi (Canon) on May 08, 2003 at 20:06 UTC

    Two ideas: just snip them off with substr. $_ = substr $_, 0, length() - 4 for @Old_URL. Or use a a substitution: s{<br>}{} for @Old_URL.

      And what if the html source changes to xhtml and the <br>'s become <br />?

        Then it breaks. I didn't even pretend that the regex as given would parse HTML. It just alters a string which happens to have some HTML of a known format in it.

Re: Strip HTML line breaks from list of URLs
by cfreak (Chaplain) on May 08, 2003 at 20:29 UTC
Re: Strip HTML line breaks from list of URLs
by svsingh (Priest) on May 08, 2003 at 21:00 UTC
    Could we get a sample of the HTML you're parsing? I built my own $input string and ran it through your code. Everything came out fine. Thanks.
      <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html lang="en"> <head> <meta name="googlebot" content="noarchive" /> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8"> <title>Paige Announces Award of $6.4 Million in Reading First Grant Fo +r Kansas C hildren</title> <link rel="stylesheet" type="text/css" href="/edgov.css" /> <!--Start EDWeb Metadata - Do Not Modify Tags in Block Below--> <meta name="DC.title" content="Paige Announces Award of $6.4 Million i +n Reading First Grant For Kansas Children"> <meta name="DC.subject" scheme="EDWeb" type="ED.program" content="Read +ing First" > <meta name="DC.language" scheme="ISO 639-2" content="en"> <meta name="DC.description" content="Secretary Paige today announced t +hat Kansas will receive more than $6.4 million for the first year of a multi-yea +r Reading First grant."> <meta name="description" content="Secretary Paige today announced that + Kansas wi ll receive more than $6.4 million for the first year of a multi-year R +eading Fir st grant."> <meta name="DC.format" scheme="IMT" content="text/HTML"> <meta name="DC.publisher" content="US Department of Education (ED)"> <meta name="ED.office" scheme="EDWeb" content="Office of Public Affair +s (OPA)"> <meta name="DC.date.valid" scheme="ISO 8601" content="2003-04-11"> <meta name="DC.subject" scheme="EDWeb" type="descriptor" content="Read +ing; Grant s"> <meta name="keywords" content="Reading; Grants; "> <meta name="DC-ED.audience" scheme="EDWeb" content="News Media"> <meta name="DC.type" scheme="EDWeb" content="Press Releases"> <!--End EDWeb Metadata - Do Not Modify Tags in Block Above--> </head> <BODY bgColor=#ffffff leftMargin=0 topMargin=0 marginwidth="0" marginh +eight="0"> <link rel="stylesheet" type="text/css" href="/edgov.css" /> <table width="100%" border="0" cellspacing="0" cellpadding="0" summary +="Table th at holds all the top header"> <tr><!--//Top right hand links//--> <td align="right" class="smallcontent_new" colspan="1"><a href="#conte +nt" class= "small">Skip Navigation</a></td> <td colspan="2" align="right" class="smallcontent_new"> <a href="/spanishresources.jsp" target="_top" class="small">Recursos e +n Espa&nti lde;ol</a>, <a href="/utilities/privacy.jsp" target="_top" class="sma +ll">Privac y, Security, Notices</a>&nbsp;</td> </tr> <tr> <td width="91" rowspan="2"><a href="/"><img src="/images/logo.gif" wid +th="91" he ight="75" border="0" alt="U.S. Department of Education" /></a></td> <td width="100%" class="background1_new" align="center"><img src="/ima +ges/edhead er_title.gif" width="385" height="54" border="0" usemap="#header" alt= +"" /><map name="header"><area alt="My.ED.gov" coords="91,26,178,52" href="/perso +nalize/dis play.jsp" target="_top" /></map></td> <td height="54" width="284" class="background1_new" align="right"> <table width="284" border="0" cellpadding="0" cellspacing="0" bgcolor= +"#000066" summary="Table holds search and subnavigation"> <tr> <td colspan="3" align="right" class="whitesmall"><a href="/about/welco +me.jsp" ta rget="_top" class="smallwhite_new"><b>About ED</b></a> | <a href="/top +ics/srchTo pics.jsp" target="_top" class="smallwhite_new"><b>A-Z Index</b></a> | +<a href="/ utilities/siteMap.jsp" target="_top" class="smallwhite_new"><b>Site Ma +p</b></a> | <a href="/utilities/contact.jsp" target="_top" class="smallwhite_new +"><b>Conta ct Us</b></a>&nbsp;</td> </tr> <tr><form name="seek1" method="GET" accept-charset="iso-8859-1" action +="/search/ searchResList.jsp" target="_top"> <input type=hidden name="st" value="0"> <input type=hidden name="colParam" value="ED"> <input type=hidden name="lk" value="1"> <td class="white">&nbsp;<strong><label for="search">Search: </label></ +strong>&nb sp;<input type="text" name="qt" size="15" class="monospace" id="search +" maxlengt h="1991">&nbsp;</td> <td><input type=image src="/images/go_b.gif" width="35" height="31" bo +rder="0" a lt="GO"></td> <td>&nbsp;<a href="/search/advSearchForm.jsp" target="_top" class="sma +llwhite_ne w"><b>Advanced</b></a>&nbsp;</td></form> </tr> </table> </td> </tr> <tr> <td colspan="2" width="100%" background="/images/navbg.gif" align="cen +ter"> <!--//Nested table that holds the navigation tabs//--> <table width="669" border="0" cellspacing="0" cellpadding="0" summary= +"All navig ation between categories is in this table"> <tr> <td><a href="/index.jsp" target="_top" OnMouseOver="document.edhome.sr +c='/images /edhome_b1.gif'" OnMouseOut="document.edhome.src='/images/edhome_b0.gi +f'"><img s rc="/images/edhome_b0.gif" width="48" height="21" border="0" alt="Home +" name="ed home" /></a></td> <td><a href="/audience/audience.jsp" target="_top" OnMouseOver="docume +nt.edaudie nce.src='/images/edaudience_b1.gif'" OnMouseOut="document.edaudience.s +rc='/image s/edaudience_b0.gif'"><img src="/images/edaudience_b0.gif" width="73" +height="21 " border="0" alt="Information for..." name="edaudience" /></a></td> <td><a href="/topics/topics.jsp?&top=Grants+%26+Contracts" target="_to +p" OnMouse Over="document.edgrants.src='/images/edgrants_b1.gif'" OnMouseOut="doc +ument.edgr ants.src='/images/edgrants_b0.gif'"><img src="/images/edgrants_b0.gif" + width="12 6" height="21" border="0" alt="Grants and Contracts" name="edgrants" / +></a></td> <td><a href="/topics/topics.jsp?&top=Financial+Aid" target="_top" OnMo +useOver="d ocument.edfinancial.src='/images/edfinancial_b1.gif'" OnMouseOut="docu +ment.edfin ancial.src='/images/edfinancial_b0.gif'"><img src="/images/edfinancial +_b0.gif" w idth="94" height="21" border="0" alt="Financial Aid" name="edfinancial +" /></a></ td> <td><a href="/topics/topics.jsp?&top=Education+Resources" target="_top +" OnMouseO ver="document.ededucation.src='/images/ededucation_b1.gif'" OnMouseOut +="document .ededucation.src='/images/ededucation_b0.gif'"><img src="/images/ededu +cation_b0. gif" width="139" height="21" border="0" alt="Education Resources" name +="ededucat ion" /></a></td> <td><a href="/topics/topics.jsp?&top=Research+%26+Stats" target="_top" + OnMouseOv er="document.edresearch.src='/images/edresearch_b1.gif'" OnMouseOut="d +ocument.ed research.src='/images/edresearch_b0.gif'"><img src="/images/edresearch +_b0.gif" w idth="118" height="21" border="0" alt="Research and Stats" name="edres +earch" />< /a></td> <td><a href="/topics/topics.jsp?&top=Policy" target="_top" OnMouseOver +="document .edpolicy.src='/images/edpolicy_b1.gif'" OnMouseOut="document.edpolicy +.src='/ima ges/edpolicy_b0.gif'"><img src="/images/edpolicy_b0.gif" width="71" he +ight="21" border="0" alt="Policy" name="edpolicy" /></a></td> </tr> </table><!--//End of nested navigation table//--> </td> </tr> <!--//Write the different subnavs based on directory//--> <tr>
        The input you posted contains no <br> tags.
Re: Strip HTML line breaks from list of URLs
by Llew_Llaw_Gyffes (Beadle) on May 09, 2003 at 00:30 UTC
    Without knowing what precisely you're trying to do overall, there's a certain amount of guesswork involved. But, that said, could you not simply do this?
    @Old_URL = grep /href=/i, split(/(<|>|\s)+/, $input);
    My recollection, which may be flawed, is that you cannot use classes such as \w and \s in an enumerated character class in a regex.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://256656]
Approved by merlyn
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (11)
As of 2014-12-27 19:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (177 votes), past polls