Mojo::DOM parsing question

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks who are always smarter than me. I'm using Mojo::DOM in a script to try and parse several html files. I can return most of the text that I need but I have one piece of html that I can't seem to get. Here's the html that I'm trying to pull:

<div class="abstract-content selected" id="enc-abstract">
<p>Secondary bile acids (BAs) and short chain fatty acids (SCFAs), two
major types of bacterial metabolites in the colon...
</p>
[download]

and here is the code that I've been trying to use. The commented out items are my many attempts at trying to get it to work and trying to look at some of my older code and other code out there.

 #<div class="abstract-content selected" id="enc-abstract">
            
              # my $abstr = $dom1->find('p[strong.class^="sub-title"]'
+)->map( sub{ $_->text } )->map( sub{ s|\n| |gr } );
              
              #div[enc-abstract]
              
              #'div["abstract-content selected" id^="enc-abstract"]'
              
              #'div[class^="officerOuter officerInner"]
              
              #    $r2   = $dom2->find( '[class="1YTREF"]'  ) -> map( 
+sub{ $_->text } );
              
              #my $abstr = $dom1->find('div[class^="abstract-content s
+elected"]')->map( sub{ $_->text } )->map( sub{ s|\n| |gr } );
              
              
             # $dom->find('div.openTime')
             #             ->map(sub{$_->children->each})
             #             ->map(sub{$_->text})
             #             ->each;
              
              
             
            # my $abstr = $dom1->at('abstract-content selected')
            #                   ->find('p')
            #                  ->map( sub{ $_->text } );
                              
              my $abstr = $dom1->find('div[class^="abstract-content se
+lected"]')->map( sub{ $_->text } )->map( sub{ s|\n| |gr } );    
                
                              
            #my $abstr = $dom1->at('#abstract-content')->find('p')->ea
+ch(sub {
                            #      $_; # this is the current element
                            #  });
                            
              
              
        #         my $abstr = $dom1->find('.abstract-content selected'
+)
        #         ->map(sub{$_->children->each})
        #        ->map( sub{ $_->text } );
        #         #->map( sub{ s|\n| |gr } );
        #         print "Abstract is: $abstr\n\n";
[download]

It's the code for the paragraph in html after the class that is messing things up I believe. I read the Mojo docs over and over and tried a few tings with child nodes but failed miserably. Any help would be greatly appreciated. it's probably very simple and I was just overthinking it, up late and tired.</>

Comment on Mojo::DOM parsing question Select or Download Code

Replies are listed 'Best First'.
Re: Mojo::DOM parsing question by marto (Cardinal) on Jul 17, 2020 at 07:49 UTC
`#!/usr/bin/perl use strict; use warnings; use feature 'say'; use Mojo::DOM; my $html = '<div class="abstract-content selected" id="enc-abstract"> <p>Secondary bile acids (BAs) and short chain fatty acids (SCFAs), two major types of bacterial metabolites in the colon... </p>'; my $dom = Mojo::DOM->new( $html ); foreach my $abstract ( $dom->find('div.abstract-content > p')->each ){ say $abstract->text; }` [download] Prints: `Secondary bile acids (BAs) and short chain fatty acids (SCFAs), two major types of bacterial metabolites in the colon...` [download] If this isn't what you want/expect please post (or link to) an HTML file you are working with, and clarify your requirements.	[reply] [d/l] [select]
Re^2: Mojo::DOM parsing question by Anonymous Monk on Jul 17, 2020 at 07:56 UTC
Thanks very much Marto! So I can understand does the > p specify the text between the `<p>` and `</p>` and the after the 'div.abstract content'? Thanks!! 2020-07-21 Athanasius added code tags.	[reply] [d/l] [select]
Re^3: Mojo::DOM parsing question by marto (Cardinal) on Jul 17, 2020 at 08:02 UTC
This is just a CSS selector `$dom->find('div.abstract-content > p')->each` Here we are doing a search to find each `div`, with class `abstract-content` which has a child `<p>` tag. MDN has a nice section on, CSS Selectors, and plenty of other resources too. To quote Child combinator: "The child combinator (>)* is placed between two CSS selectors. It matches only those elements matched by the second selector that are the direct children of elements matched by the first."* The MDN is linked to in the basics section of the overall Mojo documentation, https://mojolicious.org/perldoc. If you have any other questions let me know.	[reply] [d/l] [select]
Re^2: Mojo::DOM parsing question by Anonymous Monk on Jul 17, 2020 at 08:17 UTC
I seem to be getting stuck trying to define my variable to print based on your help. I'm using: `my $abstr = $dom1->find('div.abstract-content > p')->each;` [download] but gettting an error.	[reply] [d/l]
Re^3: Mojo::DOM parsing question by hippo (Bishop) on Jul 17, 2020 at 08:26 UTC
`each` returns a list but you are calling it in scalar context. Use list context as marto did and you will be fine. See also: Context tutorial. but gettting an error. Don't keep it a secret. Always provide the full text of the error message.	[reply] [d/l]
Re^3: Mojo::DOM parsing question by marto (Cardinal) on Jul 17, 2020 at 08:29 UTC
Is there more than one abstract per html page? My example uses `foreach` to print the text for each match we find, using the selector which matches your requirement. See find. If you know for sure that each page has one abstract you could do something like `my $abstract = $dom->at('div.abstract-content > p')->text;`. If you post the error you have maybe I can provide more help, see also How do I post a question effectively?. In your example `$abstr` will contain the number of matches.	[reply] [d/l] [select]
Re: Mojo::DOM parsing question by perlfan (Vicar) on Jul 17, 2020 at 11:18 UTC
I personally like Web::Scraper, but this talk from TPC in the Cloud this year shows some compelling alternate approaches - the the main point of the talk is not primarily parsing HTML, but he does give it a good treatment - Bruce Gray - Refactoring and Readability: Crouching Regex, Hidden Structures. IMO easily one of the top 3 talks of the conference, if not the best.	[reply]

Back to Seekers of Perl Wisdom