Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

I match a pattern in regex, yet I don't get the group I wanted to extract for some reason

by SergioQ (Beadle)
on Jan 12, 2021 at 04:35 UTC ( [id://11126776]=perlquestion: print w/replies, xml ) Need Help??

SergioQ has asked for the wisdom of the Perl Monks concerning the following question:

So here is my simplified code, and at the bottom is just a bit of the data in the $tagl scalar.

my $tagl = $resp->decoded_content; if($tagl =~ /<div class=\"soda.*?>(.*?)<\/div>/ism) { say "inside"; my $grp = $1; say $grp; }

I know I get a match, because I do get the "inside" msg. But $1 is nothing. Am far from an expert in regex, but have used this before and I have no idea why I get the match, but not the group to extract. I made the pattern match in RegEx101.com, and in there it works. And yes, I know I will have to trim the extra whitespaces, but I can't even extract the group right now.

If anyone knows what I'm missing.....thank you in advance.

I post the data in a code format because it comes out cleaner here

</div> </div> <div id="taglines_content" class="header"> <div class="header"> <div class="nav"> <div class="desc">Showing all +3 taglines</div> </div> </div> <div class="soda odd"> Power. Grace. Wisdom. Wonder. </div> <div class="soda even"> Wonder. Power. Courage. </div> <div class="soda odd"> The future of justice begins w +ith her </div> </div> </div> <div class="article" id="see_also"> <h2>See also</h2> <p> <span class="nobr"> <a href="/title/tt0451279/plotsumm +ary?ref_=tttg_sa_1" class="link">Plot Summary</a> <span class="ghost">|</span> </span> <span class="nobr"> <a href="/title/tt0451279/synopsis +?ref_=tttg_sa_2" class="link">Synopsis</a> <span class="ghost">|</span> </span> <span class="nobr"> <a href="/title/tt0451279/keywords +?ref_=tttg_sa_3" class="link">Plot Keywords</a> <span class="ghost">|</span> </span> <span class="nobr"> <a href="/title/tt0451279/parental +guide?ref_=tttg_sa_4" class="link">Parents Guide</a> </span> </p> </div> <script> if ('csm' in window) {
  • Comment on I match a pattern in regex, yet I don't get the group I wanted to extract for some reason
  • Select or Download Code

Replies are listed 'Best First'.
Re: I match a pattern in regex, yet I don't get the group I wanted to extract for some reason
by GrandFather (Saint) on Jan 12, 2021 at 05:10 UTC

    If I run:

    use strict; use warnings; my $tagl = do {local $/; <DATA>}; if($tagl =~ /<div class=\"soda.*?>(.*?)<\/div>/ism) { my $grp = $1; print "'$grp'\n"; } __DATA__ <div class="soda odd"> Power. Grace. Wisdom. Wonder. </div> <div class="soda even"> Wonder. Power. Courage. </div> <div class="soda odd"> The future of justice begins with her </div>

    it prints:

    ' Power. Grace. Wisdom. Wonder. '

    Maybe what you are trying to match or the regex you are using isn't what you have posted?

    Note that parsing HTML/XML using regexen is generally a really bad idea. You've been told this several times already. Just in case you've forgotten why, you might like a refresher with Why a regex *really* isn't good enough for HTML and XML, even for "simple" tasks

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
      Note that parsing HTML/XML using regexen is generally a really bad idea.

      The reason that it often works (for some definition of "works") is that few dynamic sites actually build and serialize a DOM tree, instead simply inserting details into (textual) templates. Regexen can match the parts of the output that come from the template, thereby selecting the insertions and extracting the desired information.

      The resulting parsers tend to be somewhat fragile, as any change to the template can invalidate the "islands" on which that the regex-based scraper relies, but can be suitable for tools that are needed quickly and for the short-term, or where inconveniences adapting the tool when the site changes are acceptable. The upside is that regex-based parsers are relatively easily written from inspecting the HTML page source without requiring knowledge of DOM structure and handling, giving them a lower "barrier of entry" for programmers unfamiliar with SGML/XML/DOM concepts.

      Maybe what you are trying to match or the regex you are using isn't what you have posted?

      I will try some of the solutions others posted, I just wanted to assure I definitely posted from the screen output. I'm the idiot who makes stupid mistakes, so this one I did multiple times, and was careful tp post the proper data. It was a real head scratcher for me. Especially since it works on RegEx101.com.

        In your original node you have posted an extract from the data, not the actual data. You also posted an extract from the code. There could be many things happening in the data you have not shown and/or the code you have not shown. Trim your data to a representative sample that still exhibits the problem for you, insert it into the smallest complete code and then post that here. See also SSCCE and How to ask better questions using Test::More and sample data. You may find in preparing these that you solve the problem - this is often a handy by-product of the exercise. :-)


        🦛

Re: I match a pattern in regex, yet I don't get the group I wanted to extract for some reason
by haukex (Archbishop) on Jan 12, 2021 at 10:31 UTC

    Do not use regular expressions to parse HTML or XML.

    use warnings; use strict; use Mojo::DOM; my $dom = Mojo::DOM->new(<<'HTML'); <div id="taglines_content" class="header"> <div class="header"> <div class="nav"> <div class="desc">Showing all 3 taglines</div> </div> </div> <div class="soda odd">Power. Grace. Wisdom. Wonder.</div> <div class="soda even">Wonder. Power. Courage.</div> <div class="soda odd">The future of justice begins with her</div> </div> HTML $dom->find('.soda')->each(sub { print "$_\n" }); __END__ <div class="soda odd">Power. Grace. Wisdom. Wonder.</div> <div class="soda even">Wonder. Power. Courage.</div> <div class="soda odd">The future of justice begins with her</div>

      use Mojo::DOM

      That did it, and was much cleaner, thank you.

        Glad to hear it! I noticed that in your code you have my $tagl =  $resp->decoded_content;, so I assume you're using an HTTP client to get the HTML. Note that Mojolicious includes Mojo::UserAgent, which has direct integration with Mojo::DOM - I showed an example here.

Re: I match a pattern in regex, yet I don't get the group I wanted to extract for some reason
by kcott (Archbishop) on Jan 12, 2021 at 07:35 UTC

    G'day SergioQ,

    I feel I should start by echoing what others have said about not using a regex for parsing this type of data.

    You haven't shown all data (which was probably a good move if it's huge). However, I'd suspect you have something like "<div class="soda"></div>" earlier in the the data; that would explain why $1 is a zero-length string (assuming that's what you meant by "is nothing").

    I recommend that you use Regexp::Debugger to see exactly what is being matched.

    As a general rule, peppering a regex with .* or .*? is a bad move: it will often produce unexpected, or at least unanticipated, results.

    If you want to match all characters up to, and including, some terminal character, then match all the characters that aren't the terminal character followed by the terminal character. For example:

    $ perl -E ' my $x = qq{ <div class="soda odd">\n Power. Grace. Wisdom. Won +der.\n </div>}; say $x; $x =~ m{<div class="soda[^>]+>\s*(.*?)\s*</div>}ms; say "|$1|"; ' <div class="soda odd"> Power. Grace. Wisdom. Wonder. </div> |Power. Grace. Wisdom. Wonder.|

    Again, I am not advocating using a regex to parse this type of data. Furthermore, if you do have "<div class="soda"></div>" earlier in the the data, $1 will still be a zero-length string

    — Ken

Re: I match a pattern in regex, yet I don't get the group I wanted to extract for some reason
by jcb (Parson) on Jan 12, 2021 at 05:00 UTC

    This would probably be a good application for HTML::Parser or any of the DOM-building modules that I am sure other monks will hasten to recommend.

    However, your problem is probably that the "stretchy" groups in your pattern are not matching as you intend. I suggest (untested) m!<div class="soda[^"]*">(.*?­)</div>! instead. The important difference is that this alternative constrains the initial "discard" match to not include double quotes, and therefore not to run past the opening div tag. Also note the use of ! as delimiter to avoid "leaning toothpick syndrome" in this version.

    If you are trying to catch multiple items from a single large input block, I suggest (also untested):

    while (m!<div class="soda[^"]*">(.*?­)</div>!g) { say "matched!"; my $grp = $1; say $grp; }

    If the text you want does not contain additional HTML, you could also replace (.*?) with ([^<]*). Generally, more constrained search patterns like these will also perform better because they will need backtracking less often.

    If the text you want can contain additional HTML, use HTML::Parser; it will work far better.

      Second this. Projects like this always expand to need to consider more things, and an event-driven parser is therefore always the "future-proof" strategy.
Re: I match a pattern in regex, yet I don't get the group I wanted to extract for some reason
by davido (Cardinal) on Jan 12, 2021 at 18:50 UTC

    Whoever gave you the impression that regular expressions are the tool of choice for parsing HTML documents did you a tremendous disservice. You should use a proper HTML / DOM parsing class. If your intent is to learn to wield regular expressions, sure, have some fun with it. If your intent is to get the job done, use the tool that is designed for the job. Mojo::DOM is my preferred solution for this sort of problem. But there are many others on CPAN too.


    Dave

Re: I match a pattern in regex, yet I don't get the group I wanted to extract for some reason
by BillKSmith (Monsignor) on Jan 12, 2021 at 18:23 UTC
    I cannot duplicate your error. Are you using 'strict'? If not, replacing a numeric '1' with a lower case 'L' in "$1" would cause this result. I mention this because I misread the last character in "$tagl".
    Bill

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11126776]
Approved by jcb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (6)
As of 2024-04-19 03:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found