Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: Possible to have regexes act on file directly (not in memory)

by Nocturnus (Beadle)
on May 03, 2014 at 09:42 UTC ( [id://1084878]=note: print w/replies, xml ) Need Help??


in reply to Possible to have regexes act on file directly (not in memory)

First of all, I would like to thank everybody! I am really overwhelmed by the number and the quality of the replies so far, and I am feeling great respect towards everybody who took the time.

Having said this, I would like to make things clearer:

First of all, I am interested in this problem at theoretical level as well as at practical level. My interest at theoretical level arises from the fact that I do not generate these files myself, i.e. I am not in control of how they are formatted or how they are structured logically. I just have to answer the question if there are certain patterns in these files, and the answer must be absolutely reliable.

Thus, I first would need to know if the problem could be solved theoretically, ignoring runtime problems. This means that breaking the files into chunks is not an option because I am forbidden to assume anything regarding their contents or structure (except that they are text files which are encoded in UTF-8, and that there are some special char encoding rules). Notably, I must not assume that a possible match is limited to a certain length.

Unfortunately, I am forbidden to give details about the actual files or how they are generated. However, I can give an example which has nothing to do with the actual situation, but imposes the same problem at theoretical level. Suppose you have a file which has the following structure:

- Text data of arbitrary length and structure, not containing the characters < or >, followed by
- Some starting tag which consists of the character <, followed by text data of arbitrary length which doesn't contain the characters < or >, followed by the character >, followed by
- Text data of arbitrary length and structure, not containing the characters < or >, followed by
- Some end tag which is structured like the start tag, followed by
- A repetition of the four preceding items

Suppose arbitrary length really could mean 100 kB or several hundreds GB, and suppose the job would be to answer the question if there are paired tags / end tags and to extract the inner content of these. Please note that I personally probably would solve such an easy case by making some mini-parser (state machine) without using regular expressions.

But actually, the patterns which must be searched are much more complicated, and again, the question if this could be solved theoretically by letting a regex act directly on a file still is interesting and important to me, not caring about performance and runtime for now.

Thus, I would give a special thanks to davido who proposed to use File::Map, and I would be grateful if some more experts could give their opinions regarding the pros and cons of this possible solution.

Thank you very much again,

Nocturnus

Replies are listed 'Best First'.
Re^2: Possible to have regexes act on file directly (not in memory)
by AnomalousMonk (Archbishop) on May 03, 2014 at 11:20 UTC
    ... I am forbidden to assume anything regarding their contents or structure ...

    This means you may not assume that in a file/string 1 TB in length, the length of a match is less than the length of the string. Or in code, writing something that could do this:

    c:\@Work\Perl\monks>perl -wMstrict -le "my $s = '<1e12 chars, none of which is a left or right angle bracket>'; ;; print qq{match, captured '$1'} if $s =~ m{ < ([^<>]*) > }xms; " match, captured '1e12 chars, none of which is a left or right angle br +acket'

    My theoretical regex-fu is weak, but as I understand it, doing this with the type of regex engine (RE) that Perl uses, an NFA, would be impossible without a fundamental and total re-write of the RE code.

    However, a DFA RE would, I imagine, be a different story. Insofar as I understand it, a DFA operates on a single character at a time without backtracking. It is the state-machine approach you mention above. The capabilities of a DFA RE are much more limited than Perl's much-enhanced (and no longer 'regular') NFA RE. However, I believe the example regex above is compatible with both NFA and DFA REs.

    If your regex could be expressed in terms acceptable to a DFA RE, there are engines already available that could, I (again) imagine, be 'easily' adapted to your application, a Simple Matter Of Programming: get a bunch of characters into a buffer; feed them one-by-one to the DFA RE; when the buffer becomes empty, get a bunch more characters; repeat until a match or end-of-file happens. Handwaving ends. Good luck in your endeavor, and I would be interested to learn your ultimate experience.

    Update:   "... a DFA operates on a single character at a time without backtracking."   That thought was badly conceived and expressed. I suppose what I was thinking was that the pattern  m{ < [^<>]* > }xms is inherently atomic (Update: hence no backtracking need occur). I have spent too little time in DFA-land to know if any such regex compiler would be smart enough to recognize this fact or could be clued-in via a construct like Perl's  (?>pattern) atomic grouping or possessive quantifiers. Just more handwaving, really.

      Well, thank you very much for making me read some articles about the theory of FA :-).

      While the patterns I have to search for are more complex than in the example I have mentioned above, I think they are still easy enough to be implemented in form of a linearly scanning state machine, so I think that's the way I will go. If the patterns get more complex, I will extend or rewrite my parser, hoping that I don't end up having to write my own regex parser. Just kidding ...

Re^2: Possible to have regexes act on file directly (not in memory)
by moritz (Cardinal) on May 04, 2014 at 09:23 UTC
    Thus, I first would need to know if the problem could be solved theoretically, ignoring runtime problems.

    Theoretically, it's quite simple. A regex is compiled to a state machine; you can read a chunk of the file, run the regex against it, and if there is no match, simply read the next chunk of the file, and feed it to the same state machine without resetting it first.

    The practical problem is that Perl doesn't expose an API that allows you to continue matching with the old state of the regex engine (as well as some optimizations that look at the length of the string, which you'd have to turn off).

    So one possible approach is to write your own regex compiler (or reuse one) and teach the matcher to do incremental matches.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1084878]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2024-03-19 07:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found