Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re^2: need help in scraping asp site

by Athanasius (Prior)
on Sep 06, 2012 at 07:31 UTC ( #992021=note: print w/ replies, xml ) Need Help??


in reply to Re: need help in scrapping asp site
in thread How to scraper ASP websites

When added to a regex, the x modifier tells the regex engine to ignore whitespace — that is, to omit the spaces, etc., in the regex from the pattern to be matched. So, if you are trying to match something like:

<td class="countryValue"> # ^ note the space

and your regex has an x modifier, you must specify the space(s) to be matched explicitly. For example:

<td \s+ class="countryValue">

That said, when I run your code with this fix applied:

while ($content =~ m! tr \s+ class="bgrow1"> <td> (.*?) + # $1 </td> <td \s+ class="countryValue"> (.*?) + # $2 country </td> <td \s+ class="destnameValue"> (.*?) + # $3 destination </td> <td \s+ class="hotelNameValue"> (.*?) + # $4 </td> <td \s+ class="durationValue"> (.*?) + # $5 trip_length </td> <td \s+ align="RIGHT" \s+ class="priceValue"> <a \s+ target="_blank" \s+ href="(.*?)"> + # $6 url (.*?) + # $7 </a> </td> !gisxm)

the regex still gets no matches, so there is more wrong than just the missing whitespace. (Or, there is more whitespace lurking in the target webpages than I have allowed for.) For further help from the monks, please follow the advice given above by davido, and reduce your problem to a minimal code snippet demonstrating the problem and complete with representative data.

BTW, the variable $airport is accessed in the final print statement, but never initialized. You would have seen this if you had begun the script with

use strict; use warnings;

as Gangabass advised in Re: How to scraper ASP websites.

Athanasius <°(((><contra mundum


Comment on Re^2: need help in scraping asp site
Select or Download Code
Re^3: need help in scraping asp site
by Anonymous Monk on Sep 06, 2012 at 07:44 UTC
    Thanks for your reply, but m,y concern is that the $content is not having the contents in proper format due to which the regex also will not work. Since, the source code are having asp, javascript syntax. Please try to run this program and let me know if you're able to produce the output.
        Even in this case WWW::Mechanize::Firefox is also not helpful......can you tell me what could be the best way to do this?
        Still not solved, can you tell me how can I get this data and using which modules?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://992021]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (9)
As of 2014-07-10 18:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (215 votes), past polls