Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Regexp matching on a multiline file: dealing with line breaks

by Athanasius (Archbishop)
on Dec 06, 2015 at 04:48 UTC ( [id://1149491]=note: print w/replies, xml ) Need Help??


in reply to Regexp matching on a multiline file: dealing with line breaks

Hello BlueStarry, and welcome to the Monastery!

If the entire file will fit in memory, a variation on kennethk’s solution is to simply delete the newlines before searching:

#! perl use strict; use warnings; my $target = 'kitten'; my $string = do { local $/; <DATA>; }; $string =~ s/\n//g; my $count = () = $string =~ /\Q$target/g; print "The target string '$target' occurs $count times in the file\n"; __DATA__ sushikitten ilovethekit tensushithe kittenisthe

Output:

14:28 >perl 1474_SoPW.pl The target string 'kitten' occurs 3 times in the file 14:28 >

However, as your input file is 5 GB, this approach is probably impractical. In which case you’re going to have to bite the bullet and implement a solution with “strange buffers” — such as a sliding window technique. Maybe have a look at Data::Iterator::SlidingWindow.

Hope that helps,

Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Replies are listed 'Best First'.
Re^2: Regexp matching on a multiline file: dealing with line breaks
by Anonymous Monk on Dec 06, 2015 at 13:17 UTC
    Hi can you please elaborate more on this statement: my $string =  do { local $/; <DATA>; };

      We want to read the whole file (in this case, the contents of the __DATA__ section at the end of the script) into the scalar variable $string. Using the diamond operator, a call to <DATA> reads the next line from the filehandle.

      So to read the whole file at once, we need to tell Perl that a “line” is the whole file. In Perl, the special variable $/ (also called $INPUT_RECORD_SEPARATOR and $RS) specifies what terminates a “line,” and undef is a special value which means “read the whole file at once.” See perlvar#Variables-related-to-filehandles.

      Since $/ is a global variable, changing its value can have far-reaching consequences across a large program. It’s therefore good practice to localize any changes to just that part of the code where they’re required. Hence the idiom of declaring the variable with the local declaration and limiting the scope of that declaration by enclosing it in a block. We could say:

      my $string; { local $/; $string = <DATA>; }

      but wrapping it up in a do block is neater and more concise.

      Hope that helps,

      Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1149491]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (6)
As of 2024-03-19 03:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found