Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

How to write CSS selector to extract more than one value from html source using scrappy module?

by shivanisai (Initiate)
on May 16, 2011 at 11:55 UTC ( #905050=perlquestion: print w/ replies, xml ) Need Help??
shivanisai has asked for the wisdom of the Perl Monks concerning the following question:

Look at the following html source
<div><p><a href="http://www.somesite.com.br/site/lojavirtual/produtos. +asp?id=2507 "><img alt="ESPELHO RETROVISOR - S00224 - SAFETY" src="http://www.some +site.com.br /site/lojavirtual/produtos/2507/peq.jpg" /> </a></div>
If I write css selector for this html source as
$scraper2->select('div p a')->data;

We can extract the {href} value of tag. But I need a single CSS selector to extract both href value and <img> src value.How can we write the selector? or could you give any sites to refer to write the CSS selectors efficiently?

  • Comment on How to write CSS selector to extract more than one value from html source using scrappy module?
  • Select or Download Code

Replies are listed 'Best First'.
Re: How to write CSS selector to extract more than one value from html source using scrappy module?
by Corion (Pope) on May 16, 2011 at 12:03 UTC

    CSS selectors cannot extract attributes.

    You can try to extract the node and the child node in two passes. It seems that Scrappy uses Web::Scraper, so maybe learning about how to do things using Web::Scraper will help you.

    I would guess that the ->focus method will allow you to select a node and its child nodes, and then you can select the link together with the img tag.

Re: How to write CSS selector to extract more than one value from html source using scrappy module?
by Anonymous Monk on May 16, 2011 at 12:05 UTC
    But I need a single CSS selector to extract both href

    No, you absolutely do not need a single CSS selector

      Based on the Scrappy synopsis you might use
      $scraper->crawl( 'http://www.example.com/page', '/page' => { 'div p a' => sub { print $_[1]->{href}, "\n"; }, 'div p img' => sub { print $_[1]->{src}, "\n"; } } );
      the selectors are made in turn, not that useful

      Scrappy::Scraper::Parser further convinces me Scrappy has too much Pee.

      Pure Web::Scraper looks simpler to manage

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://905050]
Approved by Corion
Front-paged by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2016-07-25 11:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What is your favorite alternate name for a (specific) keyboard key?


















    Results (223 votes). Check out past polls.