http://www.perlmonks.org?node_id=11119442

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks who are always smarter than me. I'm using Mojo::DOM in a script to try and parse several html files. I can return most of the text that I need but I have one piece of html that I can't seem to get. Here's the html that I'm trying to pull:

<div class="abstract-content selected" id="enc-abstract"> <p>Secondary bile acids (BAs) and short chain fatty acids (SCFAs), two major types of bacterial metabolites in the colon... </p>

and here is the code that I've been trying to use. The commented out items are my many attempts at trying to get it to work and trying to look at some of my older code and other code out there.

#<div class="abstract-content selected" id="enc-abstract"> # my $abstr = $dom1->find('p[strong.class^="sub-title"]' +)->map( sub{ $_->text } )->map( sub{ s|\n| |gr } ); #div[enc-abstract] #'div["abstract-content selected" id^="enc-abstract"]' #'div[class^="officerOuter officerInner"] # $r2 = $dom2->find( '[class="1YTREF"]' ) -> map( +sub{ $_->text } ); #my $abstr = $dom1->find('div[class^="abstract-content s +elected"]')->map( sub{ $_->text } )->map( sub{ s|\n| |gr } ); # $dom->find('div.openTime') # ->map(sub{$_->children->each}) # ->map(sub{$_->text}) # ->each; # my $abstr = $dom1->at('abstract-content selected') # ->find('p') # ->map( sub{ $_->text } ); my $abstr = $dom1->find('div[class^="abstract-content se +lected"]')->map( sub{ $_->text } )->map( sub{ s|\n| |gr } ); #my $abstr = $dom1->at('#abstract-content')->find('p')->ea +ch(sub { # $_; # this is the current element # }); # my $abstr = $dom1->find('.abstract-content selected' +) # ->map(sub{$_->children->each}) # ->map( sub{ $_->text } ); # #->map( sub{ s|\n| |gr } ); # print "Abstract is: $abstr\n\n";

It's the code for the paragraph in html after the class that is messing things up I believe. I read the Mojo docs over and over and tried a few tings with child nodes but failed miserably. Any help would be greatly appreciated. it's probably very simple and I was just overthinking it, up late and tired.</>

Replies are listed 'Best First'.
Re: Mojo::DOM parsing question
by marto (Cardinal) on Jul 17, 2020 at 07:49 UTC
    #!/usr/bin/perl use strict; use warnings; use feature 'say'; use Mojo::DOM; my $html = '<div class="abstract-content selected" id="enc-abstract"> <p>Secondary bile acids (BAs) and short chain fatty acids (SCFAs), two major types of bacterial metabolites in the colon... </p>'; my $dom = Mojo::DOM->new( $html ); foreach my $abstract ( $dom->find('div.abstract-content > p')->each ){ say $abstract->text; }

    Prints:

    Secondary bile acids (BAs) and short chain fatty acids (SCFAs), two major types of bacterial metabolites in the colon...

    If this isn't what you want/expect please post (or link to) an HTML file you are working with, and clarify your requirements.

      Thanks very much Marto! So I can understand does the > p specify the text between the <p> and </p> and the after the 'div.abstract content'? Thanks!!

      2020-07-21 Athanasius added code tags.

        This is just a CSS selector

        $dom->find('div.abstract-content > p')->each

        Here we are doing a search to find each div, with class abstract-content which has a child <p> tag.

        MDN has a nice section on, CSS Selectors, and plenty of other resources too. To quote Child combinator:

        "The child combinator (>) is placed between two CSS selectors. It matches only those elements matched by the second selector that are the direct children of elements matched by the first."

        The MDN is linked to in the basics section of the overall Mojo documentation, https://mojolicious.org/perldoc. If you have any other questions let me know.

      I seem to be getting stuck trying to define my variable to print based on your help. I'm using:
      my $abstr = $dom1->find('div.abstract-content > p')->each;
      but gettting an error.

        each returns a list but you are calling it in scalar context. Use list context as marto did and you will be fine. See also: Context tutorial.

        but gettting an error.

        Don't keep it a secret. Always provide the full text of the error message.

        Is there more than one abstract per html page? My example uses foreach to print the text for each match we find, using the selector which matches your requirement. See find. If you know for sure that each page has one abstract you could do something like my $abstract = $dom->at('div.abstract-content > p')->text;. If you post the error you have maybe I can provide more help, see also How do I post a question effectively?. In your example $abstr will contain the number of matches.

Re: Mojo::DOM parsing question
by perlfan (Vicar) on Jul 17, 2020 at 11:18 UTC