Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

How to find an xpath from an web page

by perladdict (Chaplain)
on Nov 04, 2009 at 09:06 UTC ( #804880=perlquestion: print w/ replies, xml ) Need Help??
perladdict has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, Yesterday i have not provided enough details regarding this, i am using selenium in python scripting. Which uses xpath to identify the elemnets. I have written few functions in python to validate the selenium commands which uses xpath. By using HTML::LinkExtor;. I used some web crawling tool to extract the text links as well image links which does not give the xpath.after crawling the links i have put those data in csv file. In scripting i am reading each line of the csv to pass it to appropriate functions.
Below is the html file from this i have to extract the xpath of each links.

<?xml version="1.0"?> <!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile 1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd" > <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Cache-Control" content="must-revalidate"/> <meta http-equiv="Cache-Control" content="no-cache"/> <meta http-equiv="Cache-Control" content="post-check=0, pre-check=0"/> <meta http-equiv="Expires" content="Mon 26 Jul 1997 05:00:00 GMT"/> <meta name="description" content="Browse AOL.com with AOL Mobile."/> <meta name="keywords" content="aol, aol.com, internet, aol home page, +aol mobile"/> <title>AOL.com, portal, Welcome to AOL</title> <style type="text/css"> a{border:0;text-decoration:none;color:#0064a9;font-family:Arial, Helve +tica, Sans Serif} img{border:0;vertical-align:middle}form{padding:0px;margin:0px;border: +0px}body{font-family:Arial, Helvetica, Sans Serif;overflow-x:hidden} .portal_border{ padding:0px;border-top:solid 1px #b5b1ae}.portal_logo_ +img{float:left;margin-right:2px}.portal_thin_border{ padding:0px;back +ground-color:#e0e0e0}.portal_border_bot{ padding:0px;border-bottom:so +lid 1px #b5b1ae} .portal_logo_aolimg{text-align:right;float:right;} .logobar{background-color:#efefef;border-bottom:solid #bfb1ae 2px;bord +er-top:solid #bfb1ae 1px;padding-top:2px;padding-bottom:2px;margin-to +p:2px;} .weather_icon{float:left;width:145px}.weather_bar{font-size:11px;color +:#0064a9;width:100%;border-bottom:solid 1px #b5b1ae;margin-bottom:7px +} .tb_weather{font-size:11px;color:#0064a9}.tb_login{font-weight:bold;co +lor:#0064a9;text-align:right;font-size:11px;}.tb_border{border-bottom +:solid 1px #b5b1ae;margin-bottom:7px} .dlnews{font-size:12px}.dlnews_more{font-size:12px;padding-bottom:7px} +.dlnews_link{font-weight:bold;color:#0064a9;}.featured{color:#0064a9; +background-color:#f1f1f1;font-size:12px;font-weight:bold;padding:2px; +width:100%;height:30px;}.fttable{background-color: #f1f1f1;width:100% +;font-size:12px;cellpadding:3px} .login_128{font-size:12px;font-weight:bold;color:#0064a9;text-align:le +ft}.featleft{font-size:12px;align:left;font-weight:normal;width:50%;p +adding:4px}.featright{font-size:12px;align:left;font-weight:normal;wi +dth:50%;padding:4px} .featured_first{background-color:#f1f1f1;font-size:12px;font-weight:bo +ld;width:100%;padding-bottom:4px;border-top:solid 2px #b5b1ae;padding +-top:7px} .features_low{font-size:11px;color:#004a9;background-color:#f1f1f1;fon +t-weight:normal}.mod_title_link{font-size:12px;font-weight:bold;color +:#0064a9;}.mod_img{border:0;float:left;padding:5px}.mod_txt{font-size +:12px;color:#000000;padding:5px;padding-left:0px} .mod_title{font-size:12px;font-weight:bold;color:#000000;} div.apollo_footer{background-color:#efefef;text-align:center;padding:0 +px;margin:0px;border-top:solid 1px #b5b1ae;padding-top:7px}div.apollo +_footer a{text-decoration:none;color:#0064a9;font-size:11px;padding:2 +px} .ft_copyright{font-size:11px;color:black} .atoz{font-size:16px;color:black;padding-left:3px} div.atoz_jumplinks{ +font-size:11px;text-decoration:none;margin-top:8px}div.atoz_jumplinks + a{padding:3px;color:#0064a9} .atoz_sep{padding:0px;background-color:#808080}.atoz_downloads{backgro +und-color:#fff7f0;font-size:12px}.atoz_sec{align:left;margin-left:4px +;width:145px} div.feedback{font-size:14px;padding:2px;color:#000000} .fb_tit1{font-size:12px;font-weight:bold;padding-top:7px} .fb_tit2{font-size:11px;font-weight:bold}td{font-size:11px} .fb_btn{ border:1px solid black;background-color:#488AC7;font-weight:b +old;margin:3px;font-size:12px;font-size:x-small}div.feedback form{pad +ding:0px;margin:0px;} .featured_more{font-size:12px;font-weight:bold;color:#0064a9;margin-bo +ttom:10px;text-align:right;border-bottom:solid 1px #b5b1ae;background +-color:#f1f1f1}.srch_border{border-bottom:solid 2px #b5b1ae;padding-b +ottom:7px} .newonaol_border{border-top:solid 1px #b5b1ae;padding-top:7px;margin-t +op:7px;font-size:12px;font-weight:bold;color:#000000}.horo_title_link +{font-size:12px;font-weight:bold;color:#0064a9;border-top:solid 1px # +b5b1ae;padding-top:7px;margin-top:7px;} .login{font-weight:bold;color:#0064a9;width:145px;float:left;text-alig +n:right} </style> <style type="text/css"> p {display:inline} </style> </head> <body> <div align="center"> <p style="text-align:center; mode:nowrap"><a href="http://ad.m-adx.com +/redirect/?tc=&amp;as=75088&amp;us=true&amp;ss=2c447d54-42ae-4a12-af9 +e-ae36bf18ef3f&amp;by=389400&amp;zp=&amp;ft=xhtml_markup&amp;tk=&amp; +cv=389378&amp;ti=&amp;lk=22611&amp;ry=click"><img src="http://image.m +-adx.com/smartimage/75087/389371/sportsfans_215x34_7.21.09.gif?ti=&am +p;cb=1257233428763" alt="FanHouse" /></a><br/><a href="http://ad.m-ad +x.com/redirect/?tc=&amp;as=75088&amp;us=true&amp;ss=2c447d54-42ae-4a1 +2-af9e-ae36bf18ef3f&amp;by=389400&amp;zp=&amp;ft=xhtml_markup&amp;tk= +&amp;cv=389378&amp;ti=&amp;lk=22611&amp;ry=click">Click Here!</a></p> </div> <table width="100%" class="logobar" cellspacing="0" cellpadding="0" bo +rder="0"><tr><td width="70%"><a href="/mail"> <img src="/images/orion/320/apollo/icon_mail.jpg" width="21px" height= +"21px" alt="mail" /></a> <a href="/portal/menu.do?id=5704&amp;carrier=1000"><img src="/images/o +rion/320/apollo/icon_aim.jpg" width="21px" height="21px"/></a> </td><td style="text-align:right;" width="30%"><img src="/images/orion +/320/apollo/logo_aol.gif" width="52px" height="17px" alt="logo" /></t +d></tr> <tr><td width="100%" colspan="2"> <form name="searchform" method="get" action="/portal/../portal/search. +do"><table width="100%" cellspacing="0" cellpadding="0" border="0"><t +r> <td style="padding-top:5px" width="100%" border="0"><input type="text" + name="query" maxlength= "255" size="27" style="padding:0px;margin:0p +x;width:70%;vertical-align:middle"/> <input type="image" src="/images/orion/320/apollo/search.gif" style="w +idth:46px;height:19px;padding:0px;margin-left:-4px;border:0px;vertica +l-align:middle"/></td></tr></table><input name="invocationType" value +="centersearchbox.waphome" type="hidden"/></form> </td></tr> </table> <table class="tb_border" width="100%"><tr><td class="tb_weather" width +="50%" valign="middle"><img src="/images/weather/weather_default.gif" + width="18px" height="18px" style="float:left"/> <a href="/portal/../cityguide/jsp/mylocation.jsp?returl=%2Fportal%2F"> +Get Weather</a></td> <td class="tb_login" align="right" valign="middle" width="50%"><a href +="http://wap.aol.com/auth/jsp/login.jsp?src=portal&returl=http%3A%2F% +2Fwap.aol.com%2Fportal%2F&canurl=http%3A%2F%2Fwap.aol.com%2Fportal%2F +">Sign In</a></td> </tr></table> <div style="clear:both"></div> <div class="dlnews"><b>News</b></div> <div class="dlnews"><img src="http://o.aolcdn.com/dims-global/dims3/MW +AP/resize/62x46/format/jpg/http://portal.aolcdn.com/p/images4/1-ws-04 +-missing-baby-160jc110209.jpg" style="border-right: 5px solid white;f +loat:left;width:46" alt="News"/><span><a href="http://wap.aol.com/new +s/dlDetails.do?source=portal&link=http://news.aol.com/article/baby-sh +annon-dedrick-reported-missing/748492" class="dlnews_link">Infant Van +ishes From Her Home</a><br/>Parents Reported Her Missing Late Saturda +y Morning: Focus of Search</span></div> <div style="clear:both" ></div><div class="dlnews_more" align="right"> +Story 1 of 4 <a href="/portal/../portal/dlmorenews.do"><b>More >></b +></a></div> <table class="fttable" width="100%"><tr><td colspan="2" width="100%" c +lass="featured_first">Featured</td></tr> <tr height="30px"><td width="50%" class="featleft"><div><a href="/port +al/../moviefone/" ><img src="/images/orion/320/apollo/icon_mf.jpg" st +yle="padding-right:4px;border:0px;" width="18px" height="18px"/>Movi +efone</a></div></td> <td class="featright" width="50%"><div><a href="/portal/../shopping/"> +<img src="/images/orion/320/apollo/icon_shopping.jpg" style="padding- +right:4px;border:0px;" width="18px" height="18px"/>Shopping</a></div +></td></tr> <tr height="30px"><td width="50%" class="featleft"><div><a href="/port +al/../cityguide/" ><img src="/images/orion/320/apollo/icon_cg.jpg" st +yle="padding-right:4px;border:0px;" width="18px" height="18px"/>City +Guide</a></div></td> <td class="featright" width="50%"><div><a href="http://wap.mapquest.co +m"><img src="/images/orion/320/apollo/icon_mq.jpg" style="padding-rig +ht:4px;border:0px;" width="18px" height="18px"/>MapQuest</a></div></ +td></tr> <tr height="30px"><td width="50%" class="featleft"><div><a href="/port +al/../news/" ><img src="/images/orion/320/apollo/icon_news.jpg" style +="padding-right:4px;border:0px;" width="18px" height="18px"/>News</a +></div></td> <td class="featright" width="50%"><div><a href="/portal/../fanhouse/"> +<img src="/images/orion/320/apollo/icon_sports.jpg" style="padding-ri +ght:4px;border:0px;" width="18px" height="18px"/>FanHouse</a></div>< +/td></tr> <tr height="30px"><td width="50%" class="featleft"><div><a href="/port +al/../portal/jsp/redirect.jsp?url=http://www.mocospace.com/wap/partne +rs/aol/index.jsp&linkname=mocospace" ><img src="/images/orion/320/apo +llo/icon_mocospace.jpg" style="padding-right:4px;border:0px;" width=" +18px" height="18px"/>Mocospace</a></div></td> <td class="featright" width="50%"><div><a href="/mail"><img src="/imag +es/orion/320/apollo/icon_mail1.jpg" style="padding-right:4px;border:0 +px;" width="18px" height="18px"/>Mail</a></div></td></tr> </table> <div style="clear:both"></div><div class="featured_more"><a href="/por +tal/../aolaz/">More>></a></div> <div class="mod_title_link"><a href="/portal/../tmz/">TMZ Breaking New +s</a></div> <style type="text/css"> </style> <div><img src="http://o.aolcdn.com/dims-global/dims3/MWAP/resize/52x52 +/format/jpg/http://www.blogcdn.com/www.tmz.com/media/2009/11/1102_jon +ny_danerous_ex_01.jpg" class="mod_img" style="padding:0px;border-left +: 5px solid white;border-right: 5px solid white;float:left;" /><div c +lass="mod_txt">They don't call him Jonnie Dangerous for nothing -- th +e Hollywood Hills Burglar Bunch suspect (who's...<a href="../tmz/fetc +hArticle.do?target=http%3A%2F%2Fwww.tmz.com%2F2009%2F11%2F03%2Fburgla +r-bunch-suspect-jonathan-ajar-drugs-coke-conviction-marijuana%2F">mor +e</a></div></div> <div style="clear:both"><!-- --></div> <div class="newonaol_border"><a href="/portal/../games/">New on AOL: F +ree Games</a></div><div style="height:52px"><img class="mod_img" styl +e="padding:0px" src="/images/orion/320/apollo/icon_dice.jpg"/><div cl +ass="mod_txt">Blast asteroids and play poker all on your phone...<a h +ref="/portal/../games/">more</a></div></div> </div><div style="clear:both"></div> <div class="horo_title_link" style="margin-top:0px"><a href="/../horos +copes/default/horoscope.do?zodiac=General&amp;day=today">Horoscope: G +eneral</a></div> <div><img src="/images/orion/320/horoscope-img_0.gif" width="52px" hei +ght="52px" class="mod_img" style="padding:0px;border-right: 5px solid + white;border-left: 5px solid white;float:left"/> <div class="mod_txt">Today's sensible Taurus Full Moon at 2:13 pm EST +reminds us to simplify our lives. We imag ...<a href="/../horoscopes/ +default/horoscope.do?zodiac=General&amp;day=today">more</a></div></di +v><div style="clear:both"></div> <div class="horo_title_link" style="margin-top:10px"><a href="/portal/ +../money/">Finance</a></div> <div><img class="mod_img" src="/images/orion/320/apollo/arrow_up.gif" +width="13px" height="13px"/> <span class="mod_txt"><a href="/portal/../money/jsp/detail.jsp?s=$INDU +&format=0"><b>DJIA</b></a> 9789.44 </span><span style="color:green" c +lass="mod_txt">+76.71</span></div> <div style="clear:both"></div> <div style="background-color:#efefef;text-align:center;padding:7px 0px + 0px;margin:0px;border-top:solid 1px #b5b1ae;font-size:smaller;"> <div><a href="/portal/../portal/menu.do?id=5050">AOL A-Z</a> | <a href +="/portal/../portal/menu.do?id=8000">Help</a> | <a href="/portal/../p +ortal/menu.do?id=5955&view=terms">These Terms Apply</a><br/> <a href ="/portal/../portal/menu.do?id=5054&returl=%2Fportal%2F">Feedb +ack</a> | <a href="/portal/../portal/jsp/options.jsp?returl=%2Fportal +%2F">Preferences</a> | <a href="http://wap.aol.com/auth/jsp/login.jsp +?src=portal&returl=http%3A%2F%2Fwap.aol.com%2Fportal%2F&canurl=http%3 +A%2F%2Fwap.aol.com%2Fportal%2F">Sign In</a> </div><div class="ft_copyright" style="text-align:center;"> 2009 AOL +LLC. All Rights Reserved</div></div> </body> </html>

Actually i am using selenium core to automate the webpages, the selenium IDE uses xpath as //tr4/td1/div/a/img when i record by clicking an image in an web page. In the same way i want to extract all the elements xpaths and put it in to csv. Could any one help me out how can i extract the xpath from the above html file.

Comment on How to find an xpath from an web page
Download Code
Re: How to find an xpath from an web page
by Corion (Pope) on Nov 04, 2009 at 09:10 UTC
Re: How to find an xpath from an web page
by Anonymous Monk on Nov 04, 2009 at 09:50 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://804880]
Approved by broomduster
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (8)
As of 2014-10-26 04:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (151 votes), past polls