Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Possible to have regexes act on file directly (not in memory)

by karlgoethebier (Priest)
on May 02, 2014 at 17:33 UTC ( #1084825=note: print w/ replies, xml ) Need Help??


in reply to Possible to have regexes act on file directly (not in memory)

"... split the file into chunks and test each chunk..."

If the OP got this data:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonu +my eirmod tempor invidunt ut labore et dolore magna aliquyam erat, se +d diam voluptua. At vero eos et accusam et justo duo dolores et ea re +bum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ips +um dolor sit amet.

...and splits it into two chunks:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonu +my eirmod tempor invidunt ut labore et dolore magna aliquyam erat, se +d diam voluptua. At vero eos et accusam et justo duo dolores et ea re +bum. Stet clita kasd guber
gren, no sea takimata sanctus est Lorem ipsum dolor sit amet.

...how should he ever match gubergren?

Perhaps i missed something. Anyway, best regards, Karl

«The Crux of the Biscuit is the Apostrophe»


Comment on Re: Possible to have regexes act on file directly (not in memory)
Select or Download Code
Re^2: Possible to have regexes act on file directly (not in memory)
by Laurent_R (Parson) on May 02, 2014 at 17:42 UTC
    You need to have a sliding windows on top of it: when reading a new chunk, you keep a number of bytes from the previous chunk at least as large as the maximal length of the pattern match. So you would process:
    Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonu +my eirmod tempor invidunt ut labore et dolore magna aliquyam erat, se +d diam voluptua. At vero eos et accusam et justo duo dolores et ea re +bum. Stet clita kasd guber
    keep, for example, the lita kasd guber from this chunk and append the next chunk to get:
    lita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit + amet.
    And now you have your match on the second chunk.
      "You need to have a sliding window on top of it..."

      Yes sure. Anyway, i wonder how to implement this (with one terabyte of Lorem ipsum) ;-(

      Best regards, Karl

      «The Crux of the Biscuit is the Apostrophe»

        I have basically explained how to do it: keeping part of the previous chunk and appending the new chunk to it. The real difficulty is whether it is possible to determine the length of longest possible match for the regex (which determines how much to keep from one chunk to another). For some regexes, it is very easy, for others, it is very difficult or even impossible. The OP does not give enough information on that.

        Then, there is the question of the size of the input. On my server, processing a 10 GB (line-based) file with a relatively simple regex might take 5 to 10 minutes. It would probably be a bit faster if not line-based, reading chunks of say 1 MB. With a TB of data, it would take quite a bit of time, but that might still be relatively manageable. But that's assuming a simple regex with no need to backtrack. With a regex implying a lot of backtracking, it might very easily be completely unmanageable.

        The size of the data doesn't matter, it's the maximum size of the regex which counts. Suppose you have a maximum regex size of n, then your chunk size should be n (or greater). The algorithm is then:

        1. load 1 chunk into RAM
        2. load another chunk into RAM
        3. concatenate the 2 chunks for testing
        4. Discard the older chunk
        5. Goto 2

        HTH.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1084825]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (8)
As of 2014-12-18 04:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (41 votes), past polls