Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Possible to have regexes act on file directly (not in memory)

by Nocturnus (Sexton)
on May 02, 2014 at 16:50 UTC ( #1084809=perlquestion: print w/ replies, xml ) Need Help??
Nocturnus has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I have some files which I would like to parse somehow using regular expressions. The files could be several gigabytes or even up to a terabyte in size, so loading the files to memory is not an option.

The files are not guaranteed to have line breaks, and it is not known in advance if the search pattern exists in a certain file, where it is and how often it occurs.

Hence the question: Is it possible to let a regular expression act directly on a file *without* reading any part of the file to memory?

Sorry if this has been answered already; of course, I have tried to do my homework, but probably used the wrong search keywords (I just found the usual posts or tutorials where people suggest to read the whole file into a scalar variable and let the regular expression act on that variable, or to split the file in chunks and test each chunk etc.).

Thank you very much for any thoughts,

Nocturnus

Comment on Possible to have regexes act on file directly (not in memory)
Re: Possible to have regexes act on file directly (not in memory)
by hippo (Curate) on May 02, 2014 at 17:06 UTC

    You already have one answer: split the file into chunks and test each chunk. Can you tell us why you have discarded that approach?

      How would you deal with regular expressions of variable length that cross 2 chunks? What about 3 chunks?
        Did anyone read this thread before posting?

        Or even tried following the links provided, leading to answers already given 13 years ago?

        To say it plain, this theoretical idea of regexes creating unlimited matches is bullshit academic masturbation.

        If the match is indeed too large to hold two chunks in memory (which have to be individually bigger than the maximal match), in what way do you expect to be able to process this match???

        Truth is it's far harder to construct problems which can't be solved with sliding windows, then just to solve the real world tasks.

        Cheers Rolf

        ( addicted to the Perl Programming Language)

        update

        PS: If a little boy asks for a birthday cake bigger than any building on earth, do you really honestly start discussing where to find suitable candles?

      • expanded title
Re: Possible to have regexes act on file directly (not in memory)
by MidLifeXis (Prior) on May 02, 2014 at 17:11 UTC

    Is your regexp bounded in size, or is something along the line of .* acceptable as a regexp? The answer can change the approach used when considering backtracking.

    And just a small nit: you are going to have to read chunks of the file into memory to be able to act on them. At some point in time, the program has to access data from the file.

    --MidLifeXis

Re: Possible to have regexes act on file directly (not in memory)
by LanX (Canon) on May 02, 2014 at 17:12 UTC
    > Is it possible to let a regular expression act directly on a file without reading any part of the file to memory?

    no, at least if you don't hack a Perl using the filesystem instead of RAM and hence running many magnitudes slower.

    see also Re: Memory Leak when slurping files in a loop showing how a sliding window works.

    You'll need a sliding window at least twice as big as the maximal expected match, (which should never be a problem.)

    Cheers Rolf

    ( addicted to the Perl Programming Language)

Re: Possible to have regexes act on file directly (not in memory)
by Tanktalus (Canon) on May 02, 2014 at 17:13 UTC

    I haven't tried this myself, but the first thought in my head is whether there's any way to get perl to mmap a file. Check CPAN for modules that allow you to mmap a file to memory - you will still read the file into memory in chunks, but it could be much more transparent.

    Of course, there's the other side of the coin: you need better structured data :)

      This is :mmap described in the Custom Layers section of PerlIO?

      Regards, Karl

      «The Crux of the Biscuit is the Apostrophe»

        I have always been intimidated by perl's IO layer concept. The documentation is sparse at best. Perhaps it is time for another look.
        Bill

        From a cursory glance at the documentation, no. That's still treating that memory map as a filehandle. You need to treat it as a string or scalar. A quick check on cpan gives Sys::Mmap as a reasonable candidate based on its documentation.

Re: Possible to have regexes act on file directly (not in memory)
by karlgoethebier (Curate) on May 02, 2014 at 17:22 UTC
    "...up to a terabyte in size...not guaranteed to have line breaks..."

    Mmh, i really wonder what strange kind of data this is.

    Any example available?

    Regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

Re: Possible to have regexes act on file directly (not in memory)
by karlgoethebier (Curate) on May 02, 2014 at 17:33 UTC
    "... split the file into chunks and test each chunk..."

    If the OP got this data:

    Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonu +my eirmod tempor invidunt ut labore et dolore magna aliquyam erat, se +d diam voluptua. At vero eos et accusam et justo duo dolores et ea re +bum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ips +um dolor sit amet.

    ...and splits it into two chunks:

    Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonu +my eirmod tempor invidunt ut labore et dolore magna aliquyam erat, se +d diam voluptua. At vero eos et accusam et justo duo dolores et ea re +bum. Stet clita kasd guber
    gren, no sea takimata sanctus est Lorem ipsum dolor sit amet.

    ...how should he ever match gubergren?

    Perhaps i missed something. Anyway, best regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

      You need to have a sliding windows on top of it: when reading a new chunk, you keep a number of bytes from the previous chunk at least as large as the maximal length of the pattern match. So you would process:
      Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonu +my eirmod tempor invidunt ut labore et dolore magna aliquyam erat, se +d diam voluptua. At vero eos et accusam et justo duo dolores et ea re +bum. Stet clita kasd guber
      keep, for example, the lita kasd guber from this chunk and append the next chunk to get:
      lita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit + amet.
      And now you have your match on the second chunk.
        "You need to have a sliding window on top of it..."

        Yes sure. Anyway, i wonder how to implement this (with one terabyte of Lorem ipsum) ;-(

        Best regards, Karl

        «The Crux of the Biscuit is the Apostrophe»

Re: Possible to have regexes act on file directly (not in memory)
by Anonymous Monk on May 02, 2014 at 19:36 UTC
    Don't know what "parse a file" means but how about (Un*x command-line) ... egrep -e ...
Re: Possible to have regexes act on file directly (not in memory)
by davido (Archbishop) on May 02, 2014 at 19:53 UTC

    Does your input file have any concept of logical records? Or how about this: Would your regex have any literal anchor that can be known ahead of time?

    It could be that "newlines" are an infrequent occurrence. But maybe there is some other possible concrete point of reference that could be used as a record separator so that you're reading in smaller chunks that have some logical relationship to one another.

    If that turns out to not be the case, then I suppose your best solution will be the one others have mentioned; determine what the largest possible "match" could be, set your chunk size to that size, and read a starter chunk. Then read a second chunk, concatenate them, do a pattern match, discard the first chunk, read a third, concatenate, match, repeat.


    Dave

      If that turns out to not be the case, then I suppose your best solution will be the one others have mentioned; determine what the largest possible "match" could be, set your chunk size to that size, and read a starter chunk. Then read a second chunk, concatenate them, do a pattern match, discard the first chunk, read a third, concatenate, match, repeat.

      I agree with the general approach, but not with the details. There is no reason to choose a chunk size that is equal to the largest possible match, the chunk size can be much larger.

      Suppose the max length of a possible match is 10 characters (or bytes, or whatever). You certainly don't want to read your file by chunks of 10 characters. That would be fairly inefficient.

      Depending on your system, it might be more efficient to read chunks of, say, 1 MB. The only thing you need to do is to keep the last 10 characters of the previous chunk and to "prepend" it to the next chunk before proceeding. Or, in other words, to append the next MB of data to the last 10 characters of the previous chunk. And run your regex again on that.

Re: Possible to have regexes act on file directly (not in memory)
by karlgoethebier (Curate) on May 03, 2014 at 08:48 UTC

    I'm not the OP - but thank you very much to all for sharing their knowledge and the great feedback in this very interesting thread.

    Best regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

Re: Possible to have regexes act on file directly (not in memory)
by Nocturnus (Sexton) on May 03, 2014 at 09:42 UTC

    First of all, I would like to thank everybody! I am really overwhelmed by the number and the quality of the replies so far, and I am feeling great respect towards everybody who took the time.

    Having said this, I would like to make things clearer:

    First of all, I am interested in this problem at theoretical level as well as at practical level. My interest at theoretical level arises from the fact that I do not generate these files myself, i.e. I am not in control of how they are formatted or how they are structured logically. I just have to answer the question if there are certain patterns in these files, and the answer must be absolutely reliable.

    Thus, I first would need to know if the problem could be solved theoretically, ignoring runtime problems. This means that breaking the files into chunks is not an option because I am forbidden to assume anything regarding their contents or structure (except that they are text files which are encoded in UTF-8, and that there are some special char encoding rules). Notably, I must not assume that a possible match is limited to a certain length.

    Unfortunately, I am forbidden to give details about the actual files or how they are generated. However, I can give an example which has nothing to do with the actual situation, but imposes the same problem at theoretical level. Suppose you have a file which has the following structure:

    - Text data of arbitrary length and structure, not containing the characters < or >, followed by
    - Some starting tag which consists of the character <, followed by text data of arbitrary length which doesn't contain the characters < or >, followed by the character >, followed by
    - Text data of arbitrary length and structure, not containing the characters < or >, followed by
    - Some end tag which is structured like the start tag, followed by
    - A repetition of the four preceding items

    Suppose arbitrary length really could mean 100 kB or several hundreds GB, and suppose the job would be to answer the question if there are paired tags / end tags and to extract the inner content of these. Please note that I personally probably would solve such an easy case by making some mini-parser (state machine) without using regular expressions.

    But actually, the patterns which must be searched are much more complicated, and again, the question if this could be solved theoretically by letting a regex act directly on a file still is interesting and important to me, not caring about performance and runtime for now.

    Thus, I would give a special thanks to davido who proposed to use File::Map, and I would be grateful if some more experts could give their opinions regarding the pros and cons of this possible solution.

    Thank you very much again,

    Nocturnus

      ... I am forbidden to assume anything regarding their contents or structure ...

      This means you may not assume that in a file/string 1 TB in length, the length of a match is less than the length of the string. Or in code, writing something that could do this:

      c:\@Work\Perl\monks>perl -wMstrict -le "my $s = '<1e12 chars, none of which is a left or right angle bracket>'; ;; print qq{match, captured '$1'} if $s =~ m{ < ([^<>]*) > }xms; " match, captured '1e12 chars, none of which is a left or right angle br +acket'

      My theoretical regex-fu is weak, but as I understand it, doing this with the type of regex engine (RE) that Perl uses, an NFA, would be impossible without a fundamental and total re-write of the RE code.

      However, a DFA RE would, I imagine, be a different story. Insofar as I understand it, a DFA operates on a single character at a time without backtracking. It is the state-machine approach you mention above. The capabilities of a DFA RE are much more limited than Perl's much-enhanced (and no longer 'regular') NFA RE. However, I believe the example regex above is compatible with both NFA and DFA REs.

      If your regex could be expressed in terms acceptable to a DFA RE, there are engines already available that could, I (again) imagine, be 'easily' adapted to your application, a Simple Matter Of Programming: get a bunch of characters into a buffer; feed them one-by-one to the DFA RE; when the buffer becomes empty, get a bunch more characters; repeat until a match or end-of-file happens. Handwaving ends. Good luck in your endeavor, and I would be interested to learn your ultimate experience.

      Update:   "... a DFA operates on a single character at a time without backtracking."   That thought was badly conceived and expressed. I suppose what I was thinking was that the pattern  m{ < [^<>]* > }xms is inherently atomic (Update: hence no backtracking need occur). I have spent too little time in DFA-land to know if any such regex compiler would be smart enough to recognize this fact or could be clued-in via a construct like Perl's  (?>pattern) atomic grouping or possessive quantifiers. Just more handwaving, really.

        Well, thank you very much for making me read some articles about the theory of FA :-).

        While the patterns I have to search for are more complex than in the example I have mentioned above, I think they are still easy enough to be implemented in form of a linearly scanning state machine, so I think that's the way I will go. If the patterns get more complex, I will extend or rewrite my parser, hoping that I don't end up having to write my own regex parser. Just kidding ...

      Thus, I first would need to know if the problem could be solved theoretically, ignoring runtime problems.

      Theoretically, it's quite simple. A regex is compiled to a state machine; you can read a chunk of the file, run the regex against it, and if there is no match, simply read the next chunk of the file, and feed it to the same state machine without resetting it first.

      The practical problem is that Perl doesn't expose an API that allows you to continue matching with the old state of the regex engine (as well as some optimizations that look at the length of the string, which you'd have to turn off).

      So one possible approach is to write your own regex compiler (or reuse one) and teach the matcher to do incremental matches.

Re: Possible to have regexes act on file directly (not in memory)
by RichardK (Priest) on May 03, 2014 at 12:54 UTC

    Is it possible to let a regular expression act directly on a file *without* reading any part of the file to memory?

    No, a cpu can only see data in memory, so to process a file stored on disk you have to read some of it into memory.

    However, you could write a streaming parser that reads the file one byte at a time, backtracking might be somewhat costly but it will depend on the sort of patterns you want to match. Have a look at streaming XML parsers for ideas how you might go about this.

      Besides that a CPU can only see data in memory, there are useful things like buffering layers etc. Of course, I am aware of that - sorry for not being precise enough.

      What I meant was if I could have regular expressions act directly on a file without having to explicitly load parts of that file to a variable in memory, or more precisely, without having to load parts that are dependent on the expected size of the match or such things. Such dependencies would mean that I generally would have to load the complete file to memory because I don't know anything about the size of possible matches in advance.

      Regarding the streaming XML parsers: Several years ago, I have tried some of them for another project. They all have been a fine example for the very same problem:

      They were streaming only in the sense that they read the source file line-by-line (but what if the source file does not contain any line breaks which is perfectly acceptable according to the XML standard?) or that they broke the file in chunks at syntactical markers, e.g. tags (but what if there were 100 GB text before the next marker?). I didn't see any streaming XML parser which didn't rely on such mechanisms. I admit that I have tested only a few of these parsers, so I might have missed the ultimate one.

Re: Possible to have regexes act on file directly (not in memory)
by Nocturnus (Sexton) on May 04, 2014 at 08:44 UTC

    For anybody who is interested, my final decision is now to implement my own pattern scanner (based on a state machine) which operates directly on the source files without using Perl's regular expression engine. I am not yet sure if I should explicitly operate on chunks or if I should rely on the O/S's buffering layer (which should enable me to read single characters from a file without performance penalty, shouldn't it?).

    This approach also has the advantage that I could easily port the application to another programming language (compiler based, perhaps with assembly optimizations). Using an own pattern scanner will make this much more easier.

    After having read davido's proposition of using File::Map, I initially hoped that this would be the solution. In the meantime, I have studied the documentation for this module, and if I got it right, it will read the parts of the file which actually get used into memory. Although this will happen lazily (which is a fine technique), it disqualifies the solution for that specific problem: There are enough situations where I could end up with the complete file being in memory which is exactly the thing to be avoided.

    Thanks again to everybody who helped!

    Nocturnus

Re: Possible to have regexes act on file directly (not in memory)
by Laurent_R (Parson) on May 04, 2014 at 09:08 UTC
    OK, if I understand correctly, you can't make assumptions about the size of what the full regex would possibly capture. But can you at least make some assumptions about the size of the start tags and end tags? If you can do that, then reading the file by chunks is a perfectly workable solution.

    You basically need 2 patterns that will match the start and the end tags (or more patterns if there can be several sorts of start and end tags). Then you implement in your code a state machine or a mini-parser that looks for the start tag; when you've found one, you capture everything that comes from the file until you reach the end tag, and start allover again if this is what you need. For managing the chunk boundaries, you just need a sliding window (as already discussed) that is as large as the maximal length of the start or end tags.

    Edit: @ Nocturnus: because I spent some time reading the various comments, your last message just above was not on the page I was reading when I wrote this message (in other words, I loaded the page before you posted this last message). Therefore, my message is not an answer to your very last message, but rather to the previous ones.

      When mentioning the start tags and end tags, I was just giving an example to illustrate the problem at theoretical level. In this example, I wouldn't be allowed to make assumptions even on the size of the tags.

      In reality, anyways, the patterns are different from what I mentioned in the example. They are more complicated, but I will undoubtedly be able to write a state machine based mini-parser to solve the problem. This approach will make porting the respective application to other programming languages much easier; the disadvantage is that I will have to change the software if I am given other or further patterns to search for (right now, I don't have the time and interest for writing my own parser for some general search pattern language, so I will hard-code a search algorithm for each pattern into my mini-parser).

      Being able to run regexes directly on the source files would just have been a way which is by far more comfortable, faster (in the sense of "When is the software ready?") and general and which would allow for new search patterns without changing the software.

      Thank you very much,

      Nocturnus

        > They are more complicated, but I will undoubtedly be able to write a state machine based mini-parser to solve the problem.

        You are aware of pos, right?

        > In reality, anyways, the patterns are different from what I mentioned in the example.

        If you can't produce a "real" example its better to let this thread die now. Plz don't keep people speculating.

        Cheers Rolf

        ( addicted to the Perl Programming Language)

        update

        see also XY Problem

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1084809]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (10)
As of 2014-08-27 11:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (237 votes), past polls