Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

extracting sub elements from DOM by class

by Discipulus (Canon)
on Mar 26, 2021 at 14:30 UTC ( #11130387=perlquestion: print w/replies, xml ) Need Help??

Discipulus has asked for the wisdom of the Perl Monks concerning the following question:

Hello folks,

I must admit: I dont understand the DOM and scraping websites is a pain if you dont know it.

I'm using Mojo::DOM to parse a document, but I want to select and grab some sub element too, specifying their class=

The following code is the best I was able to produce, but I want to know, for example, if a phone number comes from class="fa fa fa-phone" or class="fa fa fa-mobile-phone" (infact Francesco Petrarca has not a mobile) and I also want to grab the url of the avatar image.

I hope my code and data is not too big to read.

use strict; use warnings; use Mojo::DOM; my $data = join '',<DATA>; my $dom = Mojo::DOM->new( $data ); foreach my $memb ( $dom->find('[id="members-list"] li')->each ){ print "\n########\n"; my $writers_list = $memb ->find('*') ->map( 'text' ) ->grep( qr/\S/ ) ->join("\n") ; print $writers_list; } __DATA__ <ul id="members-list" class="item-list" role="main"> <li> <div class="item-avatar"> <a href="https://intra.example.com/coworkers/dantealighier +i/"><img src="SRCURL/></a> <span class="member-role">Sottoscrittore</span> + </div> <!-- .item-avatar --> <div class="item"> <div class="item-title"> <a href="https://intra.example.com/coworkers/dantealig +hieri/" class="heading"><h3>Dante Alighieri</h3></a> + </div> <div class="item-meta"><span class="activity">active 6 + days ago, 19 hours ago</span></div> <div class="woffice-xprofile-list"> <span><i class="fa fa fa-phone"></i>011111111</spa +n> <span><i class="fa fa fa-mobile-phone"></i>3333333 +33</span> <span><i class="fa fa fa-envelope-o"></i>dante.ali +ghieri@example.com</span> <span><i class="fa fa fa-check"></i>Poets and Writ +ers</span> </div> </div> <div class="action"></div> <div class="clear"></div> </li> <li> <div class="item-avatar"> <a href="https://intra.example.com/coworkers/francescopetr +arca/"><img src="SRCURL/></a> <span class="member-role">Sottoscrittore</span> + </div> <!-- .item-avatar --> <div class="item"> <div class="item-title"> <a href="https://intra.example.com/coworkers/francesco +ptetrarca/" class="heading"><h3>Francesco Petrarca</h3></a> + </div> <div class="item-meta"><span class="activity">active 7 + days ago, 22 hours ago</span></div> <div class="woffice-xprofile-list"> <span><i class="fa fa fa-phone"></i>02222222</span +> <span><i class="fa fa fa-mobile-phone"></i></span> <span><i class="fa fa fa-envelope-o"></i>francesco +.petrarca@example.com</span> <span><i class="fa fa fa-check"></i>Poets and Writ +ers</span> </div> </div> <div class="action"></div> <div class="clear"></div> </li> </ul>

In my mind I'd like to populate an hash like:

my %members = ( 'Dante Alighieri' => { 'avatar_url' => 'URL', 'fa fa fa-phone' => '02222222', 'fa fa fa-mobile-phone' => '333333333', 'fa fa fa-envelope-o' => 'dante.alighieri@example.com' 'fa fa fa-check' => 'Poets and Writers', }, ... );

L*

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Replies are listed 'Best First'.
Re: extracting sub elements from DOM by class
by haukex (Archbishop) on Mar 26, 2021 at 21:14 UTC
    I must admit: I dont understand the DOM and scraping websites is a pain if you dont know it.

    The basics of DOM are actually not all too difficult - it's basically a tree structure with nodes of different types. They're often represented as objects with a base "node" class that supports methods like "what are the children of this node", and the different node types are implemented as subclasses of this node (XML::LibXML works this way; Mojo::DOM AFAIK doesn't, but these are just implementation details). The two most common are "element" nodes, that represent <elements>s (including their attributes), and text nodes, that represent any text in between elements. There's also "comment" nodes that represent <!-- comments -->, etc.

    In my experience, probably one of the most common things to confuse people is that this structure is very formal and rigid, asking a question like "what is the text content of <p>Hello, <b>cool</b> World!</p>?" is not as obvious as one might think. This <p> element has three children: the text "Hello, ", the element <b>, and the text " World!". To get all the text content means to walk down the tree and include the text child node "cool" of the <b> element too. Most libraries have functions that do this for you though.

    Anyway, one nice thing about Mojo::DOM is that it supports CSS selectors. This is related to the DOM of course, but actually simplifies finding things in the DOM a lot. They're a little bit like a more flexible XPath. See Mojo::DOM::CSS: ids can be selected via #idname and classes can be selected via .classname, with automatic handling of multiple classes, e.g. your class="fa fa fa-mobile-phone" can be selected via e.g. .fa-mobile-phone or perhaps .fa.fa-mobile-phone, though interestingly I don't see a mention of the latter in the docs (it's in the W3C specs though).

    Your HTML appears to be structured as a class="item-list" with <div class="item">s containing the data, so that's what I'd start with. What I think is quite strange is <span><i class="fa fa fa-phone"></i>011111111</span>, it's unclear to me why the class="fa fa fa-phone" isn't on the <span> that actually contains the data but is instead on the empty <i> in front of it. But oh well, we can deal with that too. (Update: Oh, they're Font Awesome icons.)

    use Mojo::Base -strict, -signatures; use Mojo::DOM; use Mojo::Util qw/trim dumper/; my $dom = Mojo::DOM->new( do { local $/; <DATA> } ); my %members; $dom->find('#members-list .item')->map(sub { # assume only one .item-title (use ->find instead of ->at otherwis +e) my $name = trim( $_->at('.item-title')->all_text ); $_->find('.woffice-xprofile-list .fa')->map(sub { my $class = $_->attr('class'); # go up one node from the <i> to the <span> my $content = $_->parent->all_text; # assume no duplicates $members{$name}{$class} = $content; }); }); print dumper(\%members);
Re: extracting sub elements from DOM by class
by tangent (Parson) on Mar 26, 2021 at 18:20 UTC
    I'm not familiar with Mojo::DOM but this is how I would do it using HTML::TreeBuilder::XPath. If you don't know them already you will need to learn some Xpath expressions, but that is a useful thing to know.

    Note that <i class="fa fa fa-phone"></i> does not actually contain the phone number - it is a FontAwesome element that displays an icon - so you need to go up one level to get the content: $phone_icon->parent->as_text. Same goes for all the other FontAwesome elements.

    You may need to add checks for elements that are missing in some members. PS: I had to fix the image source in your example.

    use Data::Dumper; use HTML::TreeBuilder::XPath; my $data = join '', <DATA>; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse($data); $tree->eof; my %members; my @items = $tree->findnodes('//ul[@id="members-list"]/li'); for my $item (@items) { my ($member_link) = $item->findnodes('div/div[@class="item-title"] +/a'); my $member = $member_link->as_text; my ($avatar_img) = $item->findnodes('div[@class="item-avatar"]/a/i +mg'); my $avatar = $avatar_img->attr('src'); my ($phone_icon) = $item->findnodes('div//i[@class="fa fa fa-phone +"]'); my $phone = $phone_icon->parent->as_text; my ($mobile_icon) = $item->findnodes('div//i[@class="fa fa fa-mobi +le-phone"]'); my $mobile = $mobile_icon->parent->as_text; $members{$member} = { 'avatar_url' => $avatar, 'fa fa fa-phone' => $phone, 'fa fa fa-mobile-phone' => $mobile, }; } print Dumper(\%members); __DATA__ <ul id="members-list" class="item-list" role="main"> <li> <div class="item-avatar"> <a href="https://intra.example.com/coworkers/dantealighier +i/"><img src="SRCURL"></a> <span class="member-role">Sottoscrittore</span> + </div> <!-- .item-avatar --> <div class="item"> <div class="item-title"> <a href="https://intra.example.com/coworkers/dantealig +hieri/" class="heading"><h3>Dante Alighieri</h3></a> + </div> <div class="item-meta"><span class="activity">active 6 + days ago, 19 hours ago</span></div> <div class="woffice-xprofile-list"> <span><i class="fa fa fa-phone"></i>011111111</spa +n> <span><i class="fa fa fa-mobile-phone"></i>3333333 +33</span> <span><i class="fa fa fa-envelope-o"></i>dante.ali +ghieri@example.com</span> <span><i class="fa fa fa-check"></i>Poets and Writ +ers</span> </div> </div> <div class="action"></div> <div class="clear"></div> </li> <li> <div class="item-avatar"> <a href="https://intra.example.com/coworkers/francescopetr +arca/"><img src="SRCURL"></a> <span class="member-role">Sottoscrittore</span> + </div> <!-- .item-avatar --> <div class="item"> <div class="item-title"> <a href="https://intra.example.com/coworkers/francesco +ptetrarca/" class="heading"><h3>Francesco Petrarca</h3></a> + </div> <div class="item-meta"><span class="activity">active 7 + days ago, 22 hours ago</span></div> <div class="woffice-xprofile-list"> <span><i class="fa fa fa-phone"></i>02222222</span +> <span><i class="fa fa fa-mobile-phone"></i></span> <span><i class="fa fa fa-envelope-o"></i>francesco +.petrarca@example.com</span> <span><i class="fa fa fa-check"></i>Poets and Writ +ers</span> </div> </div> <div class="action"></div> <div class="clear"></div> </li> </ul>

    Output:

    $VAR1 = { 'Dante Alighieri' => { 'avatar_url' => 'SRCURL', 'fa fa fa-phone' => '011111111' 'fa fa fa-mobile-phone' => '333333333' }, 'Francesco Petrarca' => { 'avatar_url' => 'SRCURL' 'fa fa fa-phone' => '02222222', 'fa fa fa-mobile-phone' => '' } };
Re: extracting sub elements from DOM by class
by perlfan (Vicar) on Mar 27, 2021 at 15:58 UTC
    Mucking around with Web::Scraper provided an excellent environment for me to "get it", since I mostly learn by doing (and lots of head-desking due to PEBKAC). You may want to at least give that a shot, even if it is to get another perspective.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11130387]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (7)
As of 2023-11-29 14:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?