Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: Possible to have regexes act on file directly (not in memory)

by RichardK (Parson)
on May 03, 2014 at 12:54 UTC ( [id://1084888]=note: print w/replies, xml ) Need Help??


in reply to Possible to have regexes act on file directly (not in memory)

Is it possible to let a regular expression act directly on a file *without* reading any part of the file to memory?

No, a cpu can only see data in memory, so to process a file stored on disk you have to read some of it into memory.

However, you could write a streaming parser that reads the file one byte at a time, backtracking might be somewhat costly but it will depend on the sort of patterns you want to match. Have a look at streaming XML parsers for ideas how you might go about this.

  • Comment on Re: Possible to have regexes act on file directly (not in memory)

Replies are listed 'Best First'.
Re^2: Possible to have regexes act on file directly (not in memory)
by Nocturnus (Beadle) on May 04, 2014 at 07:47 UTC

    Besides that a CPU can only see data in memory, there are useful things like buffering layers etc. Of course, I am aware of that - sorry for not being precise enough.

    What I meant was if I could have regular expressions act directly on a file without having to explicitly load parts of that file to a variable in memory, or more precisely, without having to load parts that are dependent on the expected size of the match or such things. Such dependencies would mean that I generally would have to load the complete file to memory because I don't know anything about the size of possible matches in advance.

    Regarding the streaming XML parsers: Several years ago, I have tried some of them for another project. They all have been a fine example for the very same problem:

    They were streaming only in the sense that they read the source file line-by-line (but what if the source file does not contain any line breaks which is perfectly acceptable according to the XML standard?) or that they broke the file in chunks at syntactical markers, e.g. tags (but what if there were 100 GB text before the next marker?). I didn't see any streaming XML parser which didn't rely on such mechanisms. I admit that I have tested only a few of these parsers, so I might have missed the ultimate one.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1084888]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2024-04-25 14:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found