Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Strip PHP page

by bauer1sc (Initiate)
on Aug 06, 2007 at 17:29 UTC ( #630861=perlquestion: print w/replies, xml ) Need Help??

bauer1sc has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, Im currently trying to strip the contents of a php web page. I started out trying to print out the contents of the page im dealing with getting the results of
<html> <head> <link rel="stylesheet" href="style.css" type="text/css"> <meta name="generator" content="Bluefish"> <meta http-equiv="content-type" content="text/html;charset=iso +-8859-1"> <title>Who's Registered - Find Companies Registered to ISO 900 +0, ISO 14000 and/or related sector-specific standards</title> <meta name="keywords" content="ISO 9000 ISO 9001:1994 ISO 9001 +:2000 ISO 9002:1994 9003:1994 AS 9000 TL 9000 ISO/TS TE Supplement AS + 9100 EN 46001 ISO 13485 RC 14001 ISO 14001:1996 ISO 14001:2004 OHSAS + 18001 ISO 14001"> <meta name="description" content="WhosRegistered.com Global is + the worlds largest free global listing of certified suppliers to ISO + 9000, ISO/TS 16949, TL 9000, AS9100 and ISO 14001 anywhere in the wo +rld."> </head> <body bgcolor="#ffffff" leftmargin="0" marginwidth="0" topmargin=" +0" marginheight="0" link="yellow" vlink="yellow"> <table border="0" cellpadding="3" cellspacing="0" width="1 +00%" align="center"> <tr height="120"> <!--<td colspan="2" valign="top" align="left" heig +ht="120" class="logoarea"><a href="http://www.whosregistered.com/"> < +img src="images/dartboard.gif" border="0"> </a></td>--> <td colspan="1" valign="top" align="left" height=" +120" class="logoarea"><a href="http://www.whosregistered.com/"> <img +src="images/dartboard.gif" border="0"> </a></td> <td colspan="1" valign="middle" align="center" hei +ght="120" class="logoarea"><a href="http://www.whosregistered.com/plu +gins/phpAdsNew/click.php?bannerID=7"><img src="http://www.whosregiste +red.com/plugins/phpAdsNew/viewbanner.php?bannerID=7" width=468 height +=60 alt="Who's Who in China" border=0></a> </td> <readmore> <td height="120" class="logoarea" align="right"><a + href="http://www.qsuonline.com/cart/DirectoriesSoftware.html#9KRCDca +rt" target="_blank"><img src="images/RCDad.jpg" border="0"></a></td> </tr> <tr height="25"> <td height="25" valign="top" align="left" width="1 +20" class="topthinline"><img src="images/corner.jpg" width="25" heigh +t="25" border="0" class="topthinline2"></td> <td width="190" height="25" class="topthinline"></ +td> <td height="25" class="topthinline" align="right"> <!--Number of records in the database: --></td> <!-- </tr> --> <tr> <td width="120" valign="top" align="center" class= +"menu"> <table border="0" cellpadding="6" cellspacing= +"0" width="120" class="smallfont" style="font-family: verdana,arial;" +> <tr> <td align="right" width="20"></td> <td><a href="http://www.whosregistered +.com/">Home</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php" target="_new">Using WhosRegistered.com Global</a> +</td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/press.php" target="_new">News Releases</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.qsuonline.com/ +SubmissionInstructions/MainPage.html" target="_new">Information For R +egistrars/Certification Bodies</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#certs" target="_new">Management System Certificati +on</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#iso9000" target="_new">ISO 9000</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#iso14001" target="_new">ISO 14001</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#sector" target="_new">Sector Programs</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#monitor" target="_new">Supplier Monitoring</a></td +> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#feedback" target="_new">Supplier Feedback</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#submit" target="_new">Submitting Data</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#corrections" target="_new">Correcting Data</a></td +> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#accred" target="_new">Accreditation Marks</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#tips" target="_new">Tips for Purchasing Agents</a> +</td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#roi" target="_new">Return on Investment Survey</a> +</td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#registrars" target="_new">Find Registrars</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#journals" target="_new">Professional Journals</a>< +/td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#booksvideosoftware" target="_new">Books, Videos, S +oftware</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#minisearch" target="_new">Add WhosRegistered.com G +lobal to Your Website</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.qsuonline.com/ +BodyPages/Aboutus.html" target="_new">About QSU Publishing</a></td> </tr> <tr height="15"> <td width="20" height="15"></td> <td height="15"></td> </tr> <tr align="right" height="15"> <td align="left" width="20" height="15 +"></td> <td align="left" height="15"></td> </tr> <tr align="right" height="15"> <td align="left" width="20" height="15 +"></td> <td align="left" height="15"></td> </tr> <tr align="right" height="15"> <td align="left" width="20" height="15 +"></td> <td align="left" height="15"></td> </tr> <tr align="right"> <td align="left" width="20"></td> <td align="right" valign="bottom"></td +> </tr> <tr align="right"> <td align="left" width="20"></td> <td align="left"></td> </tr> </table> <p></p> </td> <td rowspan="2" colspan="2" valign="top" align="le +ft" class="mainbox"><img src="images/corner2.jpg" width="25" height=" +25" border="0" class="corner2"><br> <div align="center" width="675"> <p><center><h3>Welcome to WhosRegistered.com G +lobal</h3></center></p> <table width="550"> <tr><td> <p class="mainbox">The worlds largest free glo +bal listing of certified suppliers to ISO 9000, ISO/TS 16949, TL 9000 +, AS9100 and ISO 14001 anywhere in the world. Search by company name, + location even products and services listed in the scope of certific +ation. WhosRegistered.com takes the hassle out of finding certified c +ompanies.</p> </td></tr> </table> <table cellpadding="0" cellspacing="0" + border="0" bgcolor="#FFFFFF" width="650"> <tr height="30"> <td height="30" colspan=5> </td> </tr> <tr height="30"> <td width="30" height="30"></td> <td width="30" height="30" class="searchbox"> <img src="images/search-topleft.jpg" width="30" height +="30" border="0"> </td> <td height="30" class="searchbox" valign="abstop" align="r +ight"> </td> <td width="30" height="30" valign="abstop" align="right" c +lass="searchbox"> <img src="images/search-topright.jpg" width="30" heigh +t="30" class="searchboximage"> </td> <td width="30" height="30"></td> </tr> <tr> <td width="30"></td> <td width="30" class="searchbox"> </td> <td valign="middle" class="searchbox"> <!-- stage 2 --><br> <div align="center">106558 records found </div><br> <div align="center">You are on page 1 out of 3552 +total pages<br>&nbsp;<a href="./form.php?Company=&city=&sp=&country=U +nited+States&certificate_number=&Scope=&registrar_secret=&begin=0&sta +ge=2"><<</a>&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=-60&stage=2"><</a>&nbsp; <! start page 1 end page 11 -->1&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=30&stage=2">2</a>&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=60&stage=2">3</a>&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=90&stage=2">4</a>&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=120&stage=2">5</a>&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=150&stage=2">6</a>&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=180&stage=2">7</a>&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=210&stage=2">8</a>&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=240&stage=2">9</a>&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=270&stage=2">10</a>&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=300&stage=2">11</a>&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=-30&stage=2">></a>&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=106530&stage=2">>></a>&nbsp +; <table class="searchbox"> <tr> <td width="10"></td> <td width="20"></td> <td width="10"></td> <td width="100"><b>Company</b></td> <td width="10"></td> <td width="50"><b>City</b></td> <td width="10"></td> <td width="50"><b>State or Province</b></td> <td width="10"></td> <td width="30"><b>Country</b></td> <td width="10"></td> <td width="30"><b>Certificate Number</b></td> <td width="10"></td> .... continues simulary till end of page (30 entries)
I used HTML::Strip; my $hs = HTML::Strip->new(); my $page = $pageCheck->content; my $clean_text = $hs->parse( $page ); print $clean_text; and this is my output
Home Using WhosRegistered.com Global News Releases Information For Registrars/Certificati +on Bodies Management System Certification ISO 9000 ISO 14001 Sector Programs Supplier Monitoring Supplier Feedback Submitting Data Correcting Data Accreditation Marks Tips for Purchasing Agents Return on Investment Survey Find Registrars Professional Journals Books, Videos, Software Add WhosRegistered.com Global to Your Website About QSU Publishing Welcome to WhosRegistered.com Global 106558 records found You are on page 1 out of 3552 total pages
it doesn't contain any of the entrys I am actually after. Any ideas on why this may be happening or a possible solution? Thanks again monks for your help!

Replies are listed 'Best First'.
Re: Strip PHP page
by cengineer (Pilgrim) on Aug 06, 2007 at 18:00 UTC
    What are the entries that you ARE actually after??
      The entries appear on the site as a table :
      Company City State or Province Country + Certificate Number More ZF Boge Elastmetall Hebron Kentucky + United States CERT-03170-2004-AE-HOU-RAB More YKK AP America, Inc. Dublin Georgia + United States CERT-05094-2003-AE-HOU-RAB More Yamaha Motor Mfg. Corp. of America Newnan + Georgia United States CERT-03164-2003-AE-HOU-RAB + More Xycom Automation Saline Michigan + United States CERT-07076-2004-AE-HOU-RAB More Wiltech of Florida Corp., Inc. Kennedy Space +Center Florida United States CERT-05042-2003- +AE-HOU-RAB More Weir Floway, Inc. Fresno California + United States CERT-07585-2004-AE-HOU-ANABR1 More Weastec, Inc. Greenfield Ohio + United States CERT-05010-2004-AE-HOU-ANAB, R1 More Weastec, Inc. Hillsboro Ohio +United States CERT-03152-2004-AE-HOU-ANAB, R1 More Weastec, Inc. Seaman Ohio Uni +ted States CERT-05011-2004-AE-HOU-ANABR1 More Wartsila North America Ft.Lauderdale +Florida United States CERT-10118-2005-AE-HOU-ANAB + More Wartsila North America Harvey Louisia +na United States CERT-06437-2004-AE-HOU-ANABR2 More Wabash Technologies - Huntington Huntington + Indiana United States CERT-04595-2003-AE-HOU-R +ABR1 More VITRUS Pawtucket Rhode Island + United States CERT-07357-2004-AE-HOU-RAB More UT MD Anderson Bastrop Texas +United States CERT-05001-2004-AE-HOU-RAB More Vishay Siliconix Santa Clara Californ +ia United States CERT-03259-2004-AE-HOU-RAB More Veolia Water North America - Cerntral Pontiac + Michigan United States CERT-05785-2003-AE-HO +U-ANAB More UT MD Anderson Houston Texas +United States CERT-03592-2004-AE-HOU-RAB More Tyco Electronics M/A-Com, Inc. Lowell + Massachusetts United States CERT-04000-2005-AQ-HOU-A +NAB More Trigen/Cinergy Solutions Lansing Mich +igan United States CERT-04052-2005-AE-HOU-ANAB More Trefilarbed Arkansas, Inc. Pine Bluff + Arkansas United States CERT-10661-2005-AE-HOU-ANAB + More Transition Networks, Inc. Eden Prairie + Minnesota United States CERT-02683-2003-AE-HOU-RAB + More Toyota North American Parts Center - KY Hebro +n Kentucky United States CERT-06550-2004-AE-H +OU-RAB More Trace Die Cast, Inc. Bowling Green Ke +ntucky United States CERT-06234-2003-AE-HOU-RAB More Toyota Motor Sales, U.S.A., Inc. Ontario + California United States CERT-04245-2005-AE-HOU-A +NAB More Toyota Motor Sales, USA West Caldwell + New Jersey United States CERT-06180-2003-AE-HOU-ANAB + More Toyota Motor Sales, U.S.A., Inc. Torrance + California United States CERT-04246-2005-AE-HOU- +ANAB More Toyota Motor Sales USA, Inc. San Ramon + California United States CERT-03294-2004-AE-HOU-RAB + More Toyota Motor Sales U.S.A., Inc. Cincinnati + Ohio United States CERT-03419-2004-AE-HOU-ANAB, + R1 More Toyota Motor Sales U.S.A., Inc. Mansfield + Massachusetts United States CERT-02611-2003-AE-H +OU-RAB More Toyota Motor Sales U.S.A., Inc. Aurora + Illinois United States CERT-06876-2004-AE-HOU-RAB
        Is this table you speak of in a frame (or an iframe) or written out by JavaScript (document.write or such). Since you didn't simply give us the URL or the complete HTML (in read more tags) it is difficult to help. I am guessing you are using WWW::Mechanize to get this page, since you did not tell us exactly what you are doing here you have not made it easy for people to help you. See the PerlMonks FAQ and How do I post a question effectively?.

        Martin
Re: Strip PHP page
by Popcorn Dave (Abbot) on Aug 07, 2007 at 04:01 UTC
    You might also look at HTML::TokeParser to get at the information you're after. I used it to strip headlines from newspaper sites a while back and it worked well for me.


    Revolution. Today, 3 O'Clock. Meet behind the monkey bars.

    I would love to change the world, but they won't give me the source code

Re: Strip PHP page
by greywolf (Priest) on Aug 06, 2007 at 21:18 UTC
    My first thought is that your problem is because you're trying to get the results of a submitted form not the page in question. If you pass that address to your script you'll likely get the source page before the form is submitted. Check the html source that your script is receiving to make sure the results you're after are actually in there. It looks like your code stripping is working fairly well.

    mr greywolf

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://630861]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (4)
As of 2019-07-16 04:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?