Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: Matching regular expression over multiple lines

by haukex (Archbishop)
on Oct 15, 2017 at 09:53 UTC ( [id://1201393]=note: print w/replies, xml ) Need Help??


in reply to Matching regular expression over multiple lines

Welcome to Perl and the Monastery! The best thing to do is not use regexes to try to parse HTML. Here, I'm using Mojo::DOM, which is modern and fairly easy to use:

#!/usr/bin/env perl use warnings; use strict; use Mojo::DOM; my $filename = 'C:/Users/li/data_collection/posts/165644996453.html'; # slurp the whole file into memory open my $fh, '<', $filename or die $!; my $html = do { local $/; <$fh> }; close $fh; my $dom = Mojo::DOM->new($html); my $text = $dom->find('footer')->last->previous->text; print $text,"\n"; # prints "indeed I am"

The problem you're probably having with the regex in your solution is that while (<FILE>) is only reading one line at a time, but to match over multiple lines, you need to read multiple lines (or the whole file) into memory.

Update: Just to make clear what's going on in that my $html line: $dom->find('footer') returns a list of <footer> elements (probably only one?), ->last picks the last one of those, ->previous goes one node back to the <p> element, and ->text gets the text content of that element.

Replies are listed 'Best First'.
Re^2: Matching regular expression over multiple lines
by Maire (Scribe) on Oct 16, 2017 at 05:51 UTC
    Thank you for the welcome and the very clear explanation! This worked brilliantly for me (and probably saved me a lot of time in the future).

    Just out of curiosity, I went back and tried to solve the original problem with the regex after your tip about the "while" element only reading one line. You were absolutely right, and I should have been writing the following:

    open( FILE, "C:/Users/li/data_collection/posts/165644996453.html" ) || + die "couldn't open\n"; while ( <FILE> ) { $data .= $_; } if ( $data =~ m/(?<=<p>)(.*)(?=<\/p>\s+<footer>)/g ) { print "$1\n"; }
    (code taken from dsb's answer in Re: Apply regex to entire file, not just individual lines ?).

    Thanks again!
      while ( <FILE> ) { $data .= $_; }

      That'll work, but it's not particularly efficient because it chops the file up line by line and then puts it back together. You could use the same "slurp" idiom I showed (do { local $/; <$fh> }), which will read the entire file in one go, which is more efficient.

      an alternative to using a regex [quoted from here]

      I just wrote about this in general here: Parsing HTML/XML with Regular Expressions

        Ah, that makes more sense! Thanks a lot.
Re^2: Matching regular expression over multiple lines
by holli (Abbot) on Oct 15, 2017 at 10:03 UTC
    I now predict a 5 levels deep nested discussion about which HTML parsing module is better/faster/more compliant. With at least 20 nodes, at least one of which will contain a benchmark and another one critisizing said benchmark.

    That's what I love about this site :-D


    holli

    You can lead your users to water, but alas, you cannot drown them.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1201393]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (5)
As of 2024-04-25 10:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found