Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^3: Possible to have regexes act on file directly (not in memory)

by karlgoethebier (Curate)
on May 02, 2014 at 18:00 UTC ( #1084829=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Possible to have regexes act on file directly (not in memory)
in thread Possible to have regexes act on file directly (not in memory)

"You need to have a sliding window on top of it..."

Yes sure. Anyway, i wonder how to implement this (with one terabyte of Lorem ipsum) ;-(

Best regards, Karl

«The Crux of the Biscuit is the Apostrophe»


Comment on Re^3: Possible to have regexes act on file directly (not in memory)
Re^4: Possible to have regexes act on file directly (not in memory)
by Laurent_R (Parson) on May 02, 2014 at 18:56 UTC
    I have basically explained how to do it: keeping part of the previous chunk and appending the new chunk to it. The real difficulty is whether it is possible to determine the length of longest possible match for the regex (which determines how much to keep from one chunk to another). For some regexes, it is very easy, for others, it is very difficult or even impossible. The OP does not give enough information on that.

    Then, there is the question of the size of the input. On my server, processing a 10 GB (line-based) file with a relatively simple regex might take 5 to 10 minutes. It would probably be a bit faster if not line-based, reading chunks of say 1 MB. With a TB of data, it would take quite a bit of time, but that might still be relatively manageable. But that's assuming a simple regex with no need to backtrack. With a regex implying a lot of backtracking, it might very easily be completely unmanageable.

      > For some regexes, it is very easy, for others, it is very difficult or even impossible.

      it's often much simpler as you might think, cause you can decompose regexes into smaller and easier parts.

      perl -e'use re 'debug';qr/x{100}.*y{100}/' Compiling REx "x{100}.*y{100}" Final program: 1: CURLY {100,100} (5) 3: EXACT <x> (0) 5: STAR (7) 6: REG_ANY (0) 7: CURLY {100,100} (11) 9: EXACT <y> (0) 11: END (0) anchored "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"... at 0 floating "yyyyyyyyyy +yyyyyyyyyyyyyyyyyyyy"... at 100..2147483647 (checking floating) minle +n 200 Freeing REx: "x{100}.*y{100}"

      In this case you start looking for 'x'x100 in a sliding window of size >200 from the beginning. Then you search backwards from the end in sliding windows for 'y'x100.

      Like this even greedy matches can be handled (mostly) and the total match might even cover terrabytes.

      Cheers Rolf

      ( addicted to the Perl Programming Language)

        Thank you for you very interesting point.

        If I have understood you correctly (I think I did), this is IMHO no longer a straight regex pattern match, but this is really becoming a very basic parser using several regexes. Or it can be viewed as building your own very basic custom pseudo-regex engine in pure Perl, using Perl's built-in regex engine for matching the various components of the original regex pattern.

        Yes, I believe that this can indeed work on a number of cases, but this can quickly become hairy for somewhat complicated patterns having more than one * or + quantifiers, alternations, captured matches, look-around assertions, etc. I have actually built a few times very primitive parsers using several regexes to parse progressively data that I could not figure out how to analyze with a single regex. This works, but this means that if the pattern changes, then you don't only need to change the pattern, but you also probably need to write different code around it.

        But all this is really speculation so long as the OP does not tell us which kind of regexes she or he wants to use.

Re^4: Possible to have regexes act on file directly (not in memory)
by hippo (Curate) on May 02, 2014 at 18:59 UTC

    The size of the data doesn't matter, it's the maximum size of the regex which counts. Suppose you have a maximum regex size of n, then your chunk size should be n (or greater). The algorithm is then:

    1. load 1 chunk into RAM
    2. load another chunk into RAM
    3. concatenate the 2 chunks for testing
    4. Discard the older chunk
    5. Goto 2

    HTH.

      The size of the data doesn't matter, it's the maximum size of the regex which counts.

      Surely, in functional or algorithmic terms, the size of the maximum possible match is the important thing. But the size of the data does matter very much in terms of feasibility. Sometimes, it just can't be done because it would take just way too long.

      About two years ago, I had a PL/SQL-like language program to extract data inconsistencies that would take about 60 days to complete (and that is after heavy optimization, the original one would have taken 180 days); 3 or 4 days would have been acceptable in the context, not 60 days. The idea was to correct data inconsistencies, you simply can't make DB corrections based on data whose extraction was done 60 days ago. You might be interested to know that I solved the problem thanks to Perl. I removed most of the very complicated business logic (at least the part of it that was taking ages to execute) from the PL/SQL-like program to extract raw files and reprocessed these files with Perl. The program is now running in about 12 hours (the main difference being that Perl has very efficient hashes enabling very fast look-up, whereas the PL-SQL-like language did not have that, forcing for linear search into relatively large arrays billions of times). BTW, this success contributed quite a bit to convincing my colleagues to use Perl; when I arrived in the department where I am, nobody was using Perl for anything more than a few Perl one-liners here and there in shell scripts; all of my colleagues now use Perl almost daily. Even our client has been convinced: I only need to propose them to rewrite this or that program in Perl to improve performance (of course, I do that only if I have good reasons to think that we will get really improved results), and they allocate the budget almost without any further ado.

      OK, this was somewhat off-topic, but my point was: if the data is really big (and I am working daily with GB or dozens of GB of data), the size of the input can really make the difference between things that are feasible and things which are not.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1084829]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (5)
As of 2014-09-21 10:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (168 votes), past polls