Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

How to write CSS selector to extract more than one value from html source using scrappy module?

by shivanisai (Initiate)
on May 16, 2011 at 11:55 UTC ( #905050=perlquestion: print w/replies, xml ) Need Help??
shivanisai has asked for the wisdom of the Perl Monks concerning the following question:

Look at the following html source
<div><p><a href="http://www.somesite.com.br/site/lojavirtual/produtos. +asp?id=2507 "><img alt="ESPELHO RETROVISOR - S00224 - SAFETY" src="http://www.some +site.com.br /site/lojavirtual/produtos/2507/peq.jpg" /> </a></div>
If I write css selector for this html source as
$scraper2->select('div p a')->data;

We can extract the {href} value of tag. But I need a single CSS selector to extract both href value and <img> src value.How can we write the selector? or could you give any sites to refer to write the CSS selectors efficiently?

  • Comment on How to write CSS selector to extract more than one value from html source using scrappy module?
  • Select or Download Code

Replies are listed 'Best First'.
Re: How to write CSS selector to extract more than one value from html source using scrappy module?
by Corion (Pope) on May 16, 2011 at 12:03 UTC

    CSS selectors cannot extract attributes.

    You can try to extract the node and the child node in two passes. It seems that Scrappy uses Web::Scraper, so maybe learning about how to do things using Web::Scraper will help you.

    I would guess that the ->focus method will allow you to select a node and its child nodes, and then you can select the link together with the img tag.

Re: How to write CSS selector to extract more than one value from html source using scrappy module?
by Anonymous Monk on May 16, 2011 at 12:05 UTC
    But I need a single CSS selector to extract both href

    No, you absolutely do not need a single CSS selector

      Based on the Scrappy synopsis you might use
      $scraper->crawl( 'http://www.example.com/page', '/page' => { 'div p a' => sub { print $_[1]->{href}, "\n"; }, 'div p img' => sub { print $_[1]->{src}, "\n"; } } );
      the selectors are made in turn, not that useful

      Scrappy::Scraper::Parser further convinces me Scrappy has too much Pee.

      Pure Web::Scraper looks simpler to manage

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://905050]
Approved by Corion
Front-paged by toolic
help
Chatterbox?
Corion idly wonders if there is a way to produce ordered .yml files. I want to accept YAML for a configuration file format, but I also want to generate (for debugging/start) a sample configuration file from the current configuration.
[Corion]: Ideally, that configuration file would have some order of the keys, but I'm not sure whether/how YAML supports ordered output.

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2017-01-16 11:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you watch meteor showers?




    Results (149 votes). Check out past polls.