laziness, impatience, and hubris | |
PerlMonks |
Re: Possible to have regexes act on file directly (not in memory)by Nocturnus (Beadle) |
on May 03, 2014 at 09:42 UTC ( [id://1084878]=note: print w/replies, xml ) | Need Help?? |
First of all, I would like to thank everybody! I am really overwhelmed by the number and the quality of the replies so far, and I am feeling great respect towards everybody who took the time. Having said this, I would like to make things clearer: First of all, I am interested in this problem at theoretical level as well as at practical level. My interest at theoretical level arises from the fact that I do not generate these files myself, i.e. I am not in control of how they are formatted or how they are structured logically. I just have to answer the question if there are certain patterns in these files, and the answer must be absolutely reliable. Thus, I first would need to know if the problem could be solved theoretically, ignoring runtime problems. This means that breaking the files into chunks is not an option because I am forbidden to assume anything regarding their contents or structure (except that they are text files which are encoded in UTF-8, and that there are some special char encoding rules). Notably, I must not assume that a possible match is limited to a certain length. Unfortunately, I am forbidden to give details about the actual files or how they are generated. However, I can give an example which has nothing to do with the actual situation, but imposes the same problem at theoretical level. Suppose you have a file which has the following structure:
- Text data of arbitrary length and structure, not containing the characters < or >, followed by Suppose arbitrary length really could mean 100 kB or several hundreds GB, and suppose the job would be to answer the question if there are paired tags / end tags and to extract the inner content of these. Please note that I personally probably would solve such an easy case by making some mini-parser (state machine) without using regular expressions. But actually, the patterns which must be searched are much more complicated, and again, the question if this could be solved theoretically by letting a regex act directly on a file still is interesting and important to me, not caring about performance and runtime for now. Thus, I would give a special thanks to davido who proposed to use File::Map, and I would be grateful if some more experts could give their opinions regarding the pros and cons of this possible solution. Thank you very much again, Nocturnus
In Section
Seekers of Perl Wisdom
|
|