Beefy Boxes and Bandwidth Generously Provided by pair Networks Joe
Just another Perl shrine
 
PerlMonks  

use regular expressions across multiple lines from a very large input file

by rizzy (Sexton)
on Dec 05, 2010 at 17:25 UTC ( #875495=perlquestion: print w/ replies, xml ) Need Help??
rizzy has asked for the wisdom of the Perl Monks concerning the following question:

I am parsing millions of text files most of which are relatively small, but some of which cause an "out of memory!" error when using slurp, due to their size. I have been using slurp because I want to save about 200 characters before and after a keyword phrase and the text and the phrase itself may include newlines. It wasn't clear to me how to do this using line-by-line processing. Here's an example:

input.txt file (with newlines noted):
Here is my text file\n I want to save a bunch of\n charcaters before the keywords\n for example the keywords might be\n the phrase: these are my keywords\n I want to save a bunch of characters\n after the keywords too so I have\n context\n \n The keywords may appear multiple\n times in any given file and may\n span across lines like so: these are\n my keywords. This is one reason\n I was using slurp instead of reading\n in line by line
I have been slurping the file to a string and using regular expressions to find a fixed number of characters (in this example 30) before and after like so:
#!C:/Perl/bin -w use File::Slurp; my $filetext= read_file("input.txt"); while($filetext=~ m{(.{30}(these\s+are\s+my\s+keywords).{30})}gis) { print "$1\n"; }
This will spit out something like this:
keywords might be the phrase: these are my keywords I want to save a bunch of cha ay span across lines like so: these are my keywords. This is one reason I was us
Is there a more efficient way to do this (i.e., save 200 characters before and after a keyphrase) than to read the entire file into an array? It seems like reading this in line by line will not allow me to pull characters before and after newlines very easily. A workaround that I've been thinking of doing would be to read the filesize and skip the large files which I will process separately, but I imagine there is a better way.

Comment on use regular expressions across multiple lines from a very large input file
Select or Download Code
Re: use regular expressions across multiple lines from a very large input file
by BrowserUk (Pope) on Dec 05, 2010 at 18:19 UTC

    You need a sliding buffer--a supersearch for that term will turn up various implementations.

    Here's a simple one implemented using an array of lines:

    #! perl -slw use strict; my @lines; my %seen; while( <DATA> ) { push @lines, $_; my $buf = join '', @lines; if( $buf =~ /(.{30}these\s+are\s+my\s+keywords.{30})/sm ) { print "'$1'" unless $seen{ $1 }; ++$seen{ $1 }; } shift @lines if @lines > 5; } __END__ Here is my text file I want to save a bunch of charcaters before the keywords for example the keywords might be the phrase: these are my keywords I want to save a bunch of characters after the keywords too so I have context The keywords may appear multiple times in any given file and may span across lines like so: these are my keywords. This is one reason I was using slurp instead of reading in line by line

    Which produces:

    C:\test>junk 'eywords might be the phrase: these are my keywords I want to save a bunch of ch' 'y span across lines like so: these are my keywords. This is one reason I was u'

    You would probably want to make the context at either end optional so you don't miss matches at the start or end of the file where there may not be enough context to match.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thanks. I'll look into sliding buffers. This may solve another problem I'm having with memory leaks each time I slurp.
Re: use regular expressions across multiple lines from a very large input file
by LanX (Abbot) on Dec 05, 2010 at 18:29 UTC
    Hi

    I will only sketch an algorithm and leave the programming to you.

    I think you should read and process text chunks of size n, e.g. 1024 or 4096 bytes.

    Whenever you process one chunk you need to append the m first bytes of the next chunk with m=200+l and l the number of characters of your keyword string minus 1, that is 21 for "these are my keywords".

    Like this your regex will match all occurrences where at least the first character of the keyword string is still in the chunk.

    Of course you need to normalize the chunks and keywords by replacing s/s+/ /g.

    If your regex is too complicated to be normalized you can still do it by joining two - reasonably big (!) successive chunks, but you need either to memorize the match position to exclude duplicated hits or change the regex to only allow matches starting within the first chunk. (e.g. by checking pos)

    Cheers Rolf

    1) now you could even use index instead of a regex

    2) here efficiency depends on the block size of your filesystem. see seek for how to read chunks.

    3) a chunk must be bigger than the size of the longest possible match. Now quantifiers like \s+ indicate potentially infinite long matches. Are they really wanted??? Either make a reasonable limit like \s{,20} or you have to normalize your chunks by replacing s/\s+/ /g.

        Yes more or less.

        AFAI see this example doesn't handle the maximal possible length of a match, which must be smaller than one block.

        Cheers Rolf

        Great. THanks for the pointers.
      In order to speed up the search, I dare to suggest to choose a large value or n, say a value slightly less than the amount that causes the "Out of Memory" error.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        > In order to speed up the search, I dare to suggest to choose a large value or n, say a value slightly less than the amount that causes the "Out of Memory" error.

        I think you mean half that size.

        Cheers Rolf

        In order to speed up the search, I dare to suggest to choose a large value of n,

        Don't assume that the bigger the read, the faster it will run, it just doesn't work out that way.

        On my systems, 64kb reads work out marginally best (YMMV):

        C:\test>junk -B=4 < 1gb.dat Found 6559 matches in 10.778 seconds using 4 kb reads C:\test>junk -B=64 < 1gb.dat Found 6559 matches in 10.567 seconds using 64 kb reads C:\test>junk -B=256 < 1gb.dat Found 6559 matches in 10.574 seconds using 256 kb reads C:\test>junk -B=1024 < 1gb.dat Found 6559 matches in 10.938 seconds using 1024 kb reads C:\test>junk -B=4096 < 1gb.dat Found 6559 matches in 10.995 seconds using 4096 kb reads C:\test>junk -B=65536 < 1gb.dat Found 6559 matches in 12.533 seconds using 65536 kb reads

        Code:

        #! perl -slw use strict; use Time::HiRes qw[ time ]; our $B //= 64; $/ = \( $B *1024 ); binmode STDIN, ':raw:perlio'; my $start = time; my $count = 0; while( <STDIN> ) { ++$count while m[123]g; } printf "Found %d matches in %.3f seconds using %d kb reads\n", $count, time()-$start, $B;

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      Thanks for the suggestion, Rolf.
Re: use regular expressions across multiple lines from a very large input file
by ambrus (Abbot) on Dec 06, 2010 at 11:18 UTC

    Did you try reading in paragraph mode ($/ = "")? That should work provided that you don't have very long paragraphs and that your search phrase can't be split through paragraphs.

      I initially thought paragraphs might be the way to go, but these things are all formatted differently and some include html.
Re: use regular expressions across multiple lines from a very large input file
by sundialsvc4 (Monsignor) on Dec 06, 2010 at 13:33 UTC

    Also don’t neglect what existing command-line tools and scripting might be able to do for you.   (Even Windows, with their PowerShell, is finally glomming on to this...)

    For example:   grep -r regex filespec ... already does a very large part of what you are trying to do.   If you could use it simply to grab the matching phrases and “enough of the surrounding real-estate,” you could then filter what grep has sent you, to whittle it down into the final answer, using Perl or otherwise.

      Thanks. THe problem is I have thousands of tarred/zipped folders of files which I need to unzip one at a time, parse, and then delete. I haven't been able to convince the unix admin to allow me to store all of these on the server, so I'm using my machine which is running windows.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://875495]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (6)
As of 2014-04-20 16:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (485 votes), past polls