Re: I match a pattern in regex, yet I don't get the group I wanted to extract for some reason
by GrandFather (Saint) on Jan 12, 2021 at 05:10 UTC
|
use strict;
use warnings;
my $tagl = do {local $/; <DATA>};
if($tagl =~ /<div class=\"soda.*?>(.*?)<\/div>/ism) {
my $grp = $1;
print "'$grp'\n";
}
__DATA__
<div class="soda odd">
Power. Grace. Wisdom. Wonder.
</div>
<div class="soda even">
Wonder. Power. Courage.
</div>
<div class="soda odd">
The future of justice begins with her
</div>
it prints:
'
Power. Grace. Wisdom. Wonder.
'
Maybe what you are trying to match or the regex you are using isn't what you have posted?
Note that parsing HTML/XML using regexen is generally a really bad idea. You've been told this several times already. Just in case you've forgotten why, you might like a refresher with Why a regex *really* isn't good enough for HTML and XML, even for "simple" tasks
Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
| [reply] [d/l] [select] |
|
Note that parsing HTML/XML using regexen is generally a really bad idea.
The reason that it often works (for some definition of "works") is that few dynamic sites actually build and serialize a DOM tree, instead simply inserting details into (textual) templates. Regexen can match the parts of the output that come from the template, thereby selecting the insertions and extracting the desired information.
The resulting parsers tend to be somewhat fragile, as any change to the template can invalidate the "islands" on which that the regex-based scraper relies, but can be suitable for tools that are needed quickly and for the short-term, or where inconveniences adapting the tool when the site changes are acceptable. The upside is that regex-based parsers are relatively easily written from inspecting the HTML page source without requiring knowledge of DOM structure and handling, giving them a lower "barrier of entry" for programmers unfamiliar with SGML/XML/DOM concepts.
| [reply] |
|
| [reply] |
|
In your original node you have posted an extract from the data, not the actual data. You also posted an extract from the code. There could be many things happening in the data you have not shown and/or the code you have not shown. Trim your data to a representative sample that still exhibits the problem for you, insert it into the smallest complete code and then post that here. See also SSCCE and How to ask better questions using Test::More and sample data. You may find in preparing these that you solve the problem - this is often a handy by-product of the exercise. :-)
| [reply] |
Re: I match a pattern in regex, yet I don't get the group I wanted to extract for some reason
by haukex (Archbishop) on Jan 12, 2021 at 10:31 UTC
|
use warnings;
use strict;
use Mojo::DOM;
my $dom = Mojo::DOM->new(<<'HTML');
<div id="taglines_content" class="header">
<div class="header">
<div class="nav">
<div class="desc">Showing all 3 taglines</div>
</div>
</div>
<div class="soda odd">Power. Grace. Wisdom. Wonder.</div>
<div class="soda even">Wonder. Power. Courage.</div>
<div class="soda odd">The future of justice begins with her</div>
</div>
HTML
$dom->find('.soda')->each(sub { print "$_\n" });
__END__
<div class="soda odd">Power. Grace. Wisdom. Wonder.</div>
<div class="soda even">Wonder. Power. Courage.</div>
<div class="soda odd">The future of justice begins with her</div>
| [reply] [d/l] |
|
| [reply] |
|
Glad to hear it! I noticed that in your code you have my $tagl = $resp->decoded_content;, so I assume you're using an HTTP client to get the HTML. Note that Mojolicious includes Mojo::UserAgent, which has direct integration with Mojo::DOM - I showed an example here.
| [reply] [d/l] |
|
|
Re: I match a pattern in regex, yet I don't get the group I wanted to extract for some reason
by kcott (Archbishop) on Jan 12, 2021 at 07:35 UTC
|
G'day SergioQ,
I feel I should start by echoing what others have said about not using a regex for parsing this type of data.
You haven't shown all data (which was probably a good move if it's huge).
However, I'd suspect you have something like "<div class="soda"></div>" earlier in the the data;
that would explain why $1 is a zero-length string (assuming that's what you meant by "is nothing").
I recommend that you use Regexp::Debugger
to see exactly what is being matched.
As a general rule, peppering a regex with .* or .*? is a bad move:
it will often produce unexpected, or at least unanticipated, results.
If you want to match all characters up to, and including, some terminal character,
then match all the characters that aren't the terminal character followed by the terminal character.
For example:
$ perl -E '
my $x = qq{ <div class="soda odd">\n Power. Grace. Wisdom. Won
+der.\n </div>};
say $x;
$x =~ m{<div class="soda[^>]+>\s*(.*?)\s*</div>}ms;
say "|$1|";
'
<div class="soda odd">
Power. Grace. Wisdom. Wonder.
</div>
|Power. Grace. Wisdom. Wonder.|
Again, I am not advocating using a regex to parse this type of data.
Furthermore, if you do have "<div class="soda"></div>" earlier in the the data,
$1 will still be a zero-length string
| [reply] [d/l] [select] |
Re: I match a pattern in regex, yet I don't get the group I wanted to extract for some reason
by jcb (Parson) on Jan 12, 2021 at 05:00 UTC
|
This would probably be a good application for HTML::Parser or any of the DOM-building modules that I am sure other monks will hasten to recommend.
However, your problem is probably that the "stretchy" groups in your pattern are not matching as you intend. I suggest (untested) m!<div class="soda[^"]*">(.*?)</div>! instead. The important difference is that this alternative constrains the initial "discard" match to not include double quotes, and therefore not to run past the opening div tag. Also note the use of ! as delimiter to avoid "leaning toothpick syndrome" in this version.
If you are trying to catch multiple items from a single large input block, I suggest (also untested):
while (m!<div class="soda[^"]*">(.*?)</div>!g) {
say "matched!";
my $grp = $1;
say $grp;
}
If the text you want does not contain additional HTML, you could also replace (.*?) with ([^<]*). Generally, more constrained search patterns like these will also perform better because they will need backtracking less often.
If the text you want can contain additional HTML, use HTML::Parser; it will work far better. | [reply] [d/l] [select] |
|
Second this. Projects like this always expand to need to consider more things, and an event-driven parser is therefore always the "future-proof" strategy.
| [reply] |
Re: I match a pattern in regex, yet I don't get the group I wanted to extract for some reason
by davido (Cardinal) on Jan 12, 2021 at 18:50 UTC
|
Whoever gave you the impression that regular expressions are the tool of choice for parsing HTML documents did you a tremendous disservice. You should use a proper HTML / DOM parsing class. If your intent is to learn to wield regular expressions, sure, have some fun with it. If your intent is to get the job done, use the tool that is designed for the job. Mojo::DOM is my preferred solution for this sort of problem. But there are many others on CPAN too.
| [reply] |
Re: I match a pattern in regex, yet I don't get the group I wanted to extract for some reason
by BillKSmith (Monsignor) on Jan 12, 2021 at 18:23 UTC
|
I cannot duplicate your error. Are you using 'strict'? If not, replacing a numeric '1' with a lower case 'L' in "$1" would cause this result. I mention this because I misread the last character in "$tagl".
| [reply] |