Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

How to find an xpath from an web page

by perladdict (Chaplain)
on Nov 04, 2009 at 09:06 UTC ( #804880=perlquestion: print w/ replies, xml ) Need Help??
perladdict has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, Yesterday i have not provided enough details regarding this, i am using selenium in python scripting. Which uses xpath to identify the elemnets. I have written few functions in python to validate the selenium commands which uses xpath. By using HTML::LinkExtor;. I used some web crawling tool to extract the text links as well image links which does not give the xpath.after crawling the links i have put those data in csv file. In scripting i am reading each line of the csv to pass it to appropriate functions.
Below is the html file from this i have to extract the xpath of each links.

<?xml version="1.0"?> <!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile 1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd" > <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Cache-Control" content="must-revalidate"/> <meta http-equiv="Cache-Control" content="no-cache"/> <meta http-equiv="Cache-Control" content="post-check=0, pre-check=0"/> <meta http-equiv="Expires" content="Mon 26 Jul 1997 05:00:00 GMT"/> <meta name="description" content="Browse AOL.com with AOL Mobile."/> <meta name="keywords" content="aol, aol.com, internet, aol home page, +aol mobile"/> <title>AOL.com, portal, Welcome to AOL</title> <style type="text/css"> a{border:0;text-decoration:none;color:#0064a9;font-family:Arial, Helve +tica, Sans Serif} img{border:0;vertical-align:middle}form{padding:0px;margin:0px;border: +0px}body{font-family:Arial, Helvetica, Sans Serif;overflow-x:hidden} .portal_border{ padding:0px;border-top:solid 1px #b5b1ae}.portal_logo_ +img{float:left;margin-right:2px}.portal_thin_border{ padding:0px;back +ground-color:#e0e0e0}.portal_border_bot{ padding:0px;border-bottom:so +lid 1px #b5b1ae} .portal_logo_aolimg{text-align:right;float:right;} .logobar{background-color:#efefef;border-bottom:solid #bfb1ae 2px;bord +er-top:solid #bfb1ae 1px;padding-top:2px;padding-bottom:2px;margin-to +p:2px;} .weather_icon{float:left;width:145px}.weather_bar{font-size:11px;color +:#0064a9;width:100%;border-bottom:solid 1px #b5b1ae;margin-bottom:7px +} .tb_weather{font-size:11px;color:#0064a9}.tb_login{font-weight:bold;co +lor:#0064a9;text-align:right;font-size:11px;}.tb_border{border-bottom +:solid 1px #b5b1ae;margin-bottom:7px} .dlnews{font-size:12px}.dlnews_more{font-size:12px;padding-bottom:7px} +.dlnews_link{font-weight:bold;color:#0064a9;}.featured{color:#0064a9; +background-color:#f1f1f1;font-size:12px;font-weight:bold;padding:2px; +width:100%;height:30px;}.fttable{background-color: #f1f1f1;width:100% +;font-size:12px;cellpadding:3px} .login_128{font-size:12px;font-weight:bold;color:#0064a9;text-align:le +ft}.featleft{font-size:12px;align:left;font-weight:normal;width:50%;p +adding:4px}.featright{font-size:12px;align:left;font-weight:normal;wi +dth:50%;padding:4px} .featured_first{background-color:#f1f1f1;font-size:12px;font-weight:bo +ld;width:100%;padding-bottom:4px;border-top:solid 2px #b5b1ae;padding +-top:7px} .features_low{font-size:11px;color:#004a9;background-color:#f1f1f1;fon +t-weight:normal}.mod_title_link{font-size:12px;font-weight:bold;color +:#0064a9;}.mod_img{border:0;float:left;padding:5px}.mod_txt{font-size +:12px;color:#000000;padding:5px;padding-left:0px} .mod_title{font-size:12px;font-weight:bold;color:#000000;} div.apollo_footer{background-color:#efefef;text-align:center;padding:0 +px;margin:0px;border-top:solid 1px #b5b1ae;padding-top:7px}div.apollo +_footer a{text-decoration:none;color:#0064a9;font-size:11px;padding:2 +px} .ft_copyright{font-size:11px;color:black} .atoz{font-size:16px;color:black;padding-left:3px} div.atoz_jumplinks{ +font-size:11px;text-decoration:none;margin-top:8px}div.atoz_jumplinks + a{padding:3px;color:#0064a9} .atoz_sep{padding:0px;background-color:#808080}.atoz_downloads{backgro +und-color:#fff7f0;font-size:12px}.atoz_sec{align:left;margin-left:4px +;width:145px} div.feedback{font-size:14px;padding:2px;color:#000000} .fb_tit1{font-size:12px;font-weight:bold;padding-top:7px} .fb_tit2{font-size:11px;font-weight:bold}td{font-size:11px} .fb_btn{ border:1px solid black;background-color:#488AC7;font-weight:b +old;margin:3px;font-size:12px;font-size:x-small}div.feedback form{pad +ding:0px;margin:0px;} .featured_more{font-size:12px;font-weight:bold;color:#0064a9;margin-bo +ttom:10px;text-align:right;border-bottom:solid 1px #b5b1ae;background +-color:#f1f1f1}.srch_border{border-bottom:solid 2px #b5b1ae;padding-b +ottom:7px} .newonaol_border{border-top:solid 1px #b5b1ae;padding-top:7px;margin-t +op:7px;font-size:12px;font-weight:bold;color:#000000}.horo_title_link +{font-size:12px;font-weight:bold;color:#0064a9;border-top:solid 1px # +b5b1ae;padding-top:7px;margin-top:7px;} .login{font-weight:bold;color:#0064a9;width:145px;float:left;text-alig +n:right} </style> <style type="text/css"> p {display:inline} </style> </head> <body> <div align="center"> <p style="text-align:center; mode:nowrap"><a href="http://ad.m-adx.com +/redirect/?tc=&amp;as=75088&amp;us=true&amp;ss=2c447d54-42ae-4a12-af9 +e-ae36bf18ef3f&amp;by=389400&amp;zp=&amp;ft=xhtml_markup&amp;tk=&amp; +cv=389378&amp;ti=&amp;lk=22611&amp;ry=click"><img src="http://image.m +-adx.com/smartimage/75087/389371/sportsfans_215x34_7.21.09.gif?ti=&am +p;cb=1257233428763" alt="FanHouse" /></a><br/><a href="http://ad.m-ad +x.com/redirect/?tc=&amp;as=75088&amp;us=true&amp;ss=2c447d54-42ae-4a1 +2-af9e-ae36bf18ef3f&amp;by=389400&amp;zp=&amp;ft=xhtml_markup&amp;tk= +&amp;cv=389378&amp;ti=&amp;lk=22611&amp;ry=click">Click Here!</a></p> </div> <table width="100%" class="logobar" cellspacing="0" cellpadding="0" bo +rder="0"><tr><td width="70%"><a href="/mail"> <img src="/images/orion/320/apollo/icon_mail.jpg" width="21px" height= +"21px" alt="mail" /></a> <a href="/portal/menu.do?id=5704&amp;carrier=1000"><img src="/images/o +rion/320/apollo/icon_aim.jpg" width="21px" height="21px"/></a> </td><td style="text-align:right;" width="30%"><img src="/images/orion +/320/apollo/logo_aol.gif" width="52px" height="17px" alt="logo" /></t +d></tr> <tr><td width="100%" colspan="2"> <form name="searchform" method="get" action="/portal/../portal/search. +do"><table width="100%" cellspacing="0" cellpadding="0" border="0"><t +r> <td style="padding-top:5px" width="100%" border="0"><input type="text" + name="query" maxlength= "255" size="27" style="padding:0px;margin:0p +x;width:70%;vertical-align:middle"/> <input type="image" src="/images/orion/320/apollo/search.gif" style="w +idth:46px;height:19px;padding:0px;margin-left:-4px;border:0px;vertica +l-align:middle"/></td></tr></table><input name="invocationType" value +="centersearchbox.waphome" type="hidden"/></form> </td></tr> </table> <table class="tb_border" width="100%"><tr><td class="tb_weather" width +="50%" valign="middle"><img src="/images/weather/weather_default.gif" + width="18px" height="18px" style="float:left"/> <a href="/portal/../cityguide/jsp/mylocation.jsp?returl=%2Fportal%2F"> +Get Weather</a></td> <td class="tb_login" align="right" valign="middle" width="50%"><a href +="http://wap.aol.com/auth/jsp/login.jsp?src=portal&returl=http%3A%2F% +2Fwap.aol.com%2Fportal%2F&canurl=http%3A%2F%2Fwap.aol.com%2Fportal%2F +">Sign In</a></td> </tr></table> <div style="clear:both"></div> <div class="dlnews"><b>News</b></div> <div class="dlnews"><img src="http://o.aolcdn.com/dims-global/dims3/MW +AP/resize/62x46/format/jpg/http://portal.aolcdn.com/p/images4/1-ws-04 +-missing-baby-160jc110209.jpg" style="border-right: 5px solid white;f +loat:left;width:46" alt="News"/><span><a href="http://wap.aol.com/new +s/dlDetails.do?source=portal&link=http://news.aol.com/article/baby-sh +annon-dedrick-reported-missing/748492" class="dlnews_link">Infant Van +ishes From Her Home</a><br/>Parents Reported Her Missing Late Saturda +y Morning: Focus of Search</span></div> <div style="clear:both" ></div><div class="dlnews_more" align="right"> +Story 1 of 4 <a href="/portal/../portal/dlmorenews.do"><b>More >></b +></a></div> <table class="fttable" width="100%"><tr><td colspan="2" width="100%" c +lass="featured_first">Featured</td></tr> <tr height="30px"><td width="50%" class="featleft"><div><a href="/port +al/../moviefone/" ><img src="/images/orion/320/apollo/icon_mf.jpg" st +yle="padding-right:4px;border:0px;" width="18px" height="18px"/>Movi +efone</a></div></td> <td class="featright" width="50%"><div><a href="/portal/../shopping/"> +<img src="/images/orion/320/apollo/icon_shopping.jpg" style="padding- +right:4px;border:0px;" width="18px" height="18px"/>Shopping</a></div +></td></tr> <tr height="30px"><td width="50%" class="featleft"><div><a href="/port +al/../cityguide/" ><img src="/images/orion/320/apollo/icon_cg.jpg" st +yle="padding-right:4px;border:0px;" width="18px" height="18px"/>City +Guide</a></div></td> <td class="featright" width="50%"><div><a href="http://wap.mapquest.co +m"><img src="/images/orion/320/apollo/icon_mq.jpg" style="padding-rig +ht:4px;border:0px;" width="18px" height="18px"/>MapQuest</a></div></ +td></tr> <tr height="30px"><td width="50%" class="featleft"><div><a href="/port +al/../news/" ><img src="/images/orion/320/apollo/icon_news.jpg" style +="padding-right:4px;border:0px;" width="18px" height="18px"/>News</a +></div></td> <td class="featright" width="50%"><div><a href="/portal/../fanhouse/"> +<img src="/images/orion/320/apollo/icon_sports.jpg" style="padding-ri +ght:4px;border:0px;" width="18px" height="18px"/>FanHouse</a></div>< +/td></tr> <tr height="30px"><td width="50%" class="featleft"><div><a href="/port +al/../portal/jsp/redirect.jsp?url=http://www.mocospace.com/wap/partne +rs/aol/index.jsp&linkname=mocospace" ><img src="/images/orion/320/apo +llo/icon_mocospace.jpg" style="padding-right:4px;border:0px;" width=" +18px" height="18px"/>Mocospace</a></div></td> <td class="featright" width="50%"><div><a href="/mail"><img src="/imag +es/orion/320/apollo/icon_mail1.jpg" style="padding-right:4px;border:0 +px;" width="18px" height="18px"/>Mail</a></div></td></tr> </table> <div style="clear:both"></div><div class="featured_more"><a href="/por +tal/../aolaz/">More>></a></div> <div class="mod_title_link"><a href="/portal/../tmz/">TMZ Breaking New +s</a></div> <style type="text/css"> </style> <div><img src="http://o.aolcdn.com/dims-global/dims3/MWAP/resize/52x52 +/format/jpg/http://www.blogcdn.com/www.tmz.com/media/2009/11/1102_jon +ny_danerous_ex_01.jpg" class="mod_img" style="padding:0px;border-left +: 5px solid white;border-right: 5px solid white;float:left;" /><div c +lass="mod_txt">They don't call him Jonnie Dangerous for nothing -- th +e Hollywood Hills Burglar Bunch suspect (who's...<a href="../tmz/fetc +hArticle.do?target=http%3A%2F%2Fwww.tmz.com%2F2009%2F11%2F03%2Fburgla +r-bunch-suspect-jonathan-ajar-drugs-coke-conviction-marijuana%2F">mor +e</a></div></div> <div style="clear:both"><!-- --></div> <div class="newonaol_border"><a href="/portal/../games/">New on AOL: F +ree Games</a></div><div style="height:52px"><img class="mod_img" styl +e="padding:0px" src="/images/orion/320/apollo/icon_dice.jpg"/><div cl +ass="mod_txt">Blast asteroids and play poker all on your phone...<a h +ref="/portal/../games/">more</a></div></div> </div><div style="clear:both"></div> <div class="horo_title_link" style="margin-top:0px"><a href="/../horos +copes/default/horoscope.do?zodiac=General&amp;day=today">Horoscope: G +eneral</a></div> <div><img src="/images/orion/320/horoscope-img_0.gif" width="52px" hei +ght="52px" class="mod_img" style="padding:0px;border-right: 5px solid + white;border-left: 5px solid white;float:left"/> <div class="mod_txt">Today's sensible Taurus Full Moon at 2:13 pm EST +reminds us to simplify our lives. We imag ...<a href="/../horoscopes/ +default/horoscope.do?zodiac=General&amp;day=today">more</a></div></di +v><div style="clear:both"></div> <div class="horo_title_link" style="margin-top:10px"><a href="/portal/ +../money/">Finance</a></div> <div><img class="mod_img" src="/images/orion/320/apollo/arrow_up.gif" +width="13px" height="13px"/> <span class="mod_txt"><a href="/portal/../money/jsp/detail.jsp?s=$INDU +&format=0"><b>DJIA</b></a> 9789.44 </span><span style="color:green" c +lass="mod_txt">+76.71</span></div> <div style="clear:both"></div> <div style="background-color:#efefef;text-align:center;padding:7px 0px + 0px;margin:0px;border-top:solid 1px #b5b1ae;font-size:smaller;"> <div><a href="/portal/../portal/menu.do?id=5050">AOL A-Z</a> | <a href +="/portal/../portal/menu.do?id=8000">Help</a> | <a href="/portal/../p +ortal/menu.do?id=5955&view=terms">These Terms Apply</a><br/> <a href ="/portal/../portal/menu.do?id=5054&returl=%2Fportal%2F">Feedb +ack</a> | <a href="/portal/../portal/jsp/options.jsp?returl=%2Fportal +%2F">Preferences</a> | <a href="http://wap.aol.com/auth/jsp/login.jsp +?src=portal&returl=http%3A%2F%2Fwap.aol.com%2Fportal%2F&canurl=http%3 +A%2F%2Fwap.aol.com%2Fportal%2F">Sign In</a> </div><div class="ft_copyright" style="text-align:center;"> 2009 AOL +LLC. All Rights Reserved</div></div> </body> </html>

Actually i am using selenium core to automate the webpages, the selenium IDE uses xpath as //tr4/td1/div/a/img when i record by clicking an image in an web page. In the same way i want to extract all the elements xpaths and put it in to csv. Could any one help me out how can i extract the xpath from the above html file.

Comment on How to find an xpath from an web page
Download Code
Re: How to find an xpath from an web page
by Corion (Pope) on Nov 04, 2009 at 09:10 UTC
Re: How to find an xpath from an web page
by Anonymous Monk on Nov 04, 2009 at 09:50 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://804880]
Approved by broomduster
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (7)
As of 2014-09-16 08:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (158 votes), past polls