Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Parsing 12GB Entourage database in pieces...

by betterworld (Curate)
on Aug 28, 2008 at 19:52 UTC ( [id://707579]=note: print w/replies, xml ) Need Help??


in reply to Parsing 12GB Entourage database in pieces...

There might be a completely different solution. Using Sys::Mmap, you can map the entire file to a single string, and then brute-force through that string with something like while ($string =~ m/.../g) {...} (like you hinted in your post).

However, there are some caveats:

  • Not all operating systems support mmap;
  • you'd need a 64 bit system to fit those 12GB into your program's address space;
  • I don't know how regular expressions perform with such a huge string. I've just tried it on a mmapped 200MB file (most of which consisted of \0) and it did quite well, but that's much smaller than your file.

Replies are listed 'Best First'.
Re^2: Parsing 12GB Entourage database in pieces...
by ikegami (Patriarch) on Aug 28, 2008 at 20:06 UTC

    I don't know how regular expressions perform with such a huge string.

    Some uses of "*" are equivalent to "{0,32767}", so you might have problems.

    >perl -Mre=debug -we"qr/^(.)(\1*)\z/" ... 9: CURLYX[1] {0,32767}(14) ...

    Be sure to prevent backtracking using (?>...) or (in 5.10.0+) the possessive quantifier.

    Update: "/\0\0MSrc.{16}((?>[^\0]*))(?=\0)/s" looks safe.

      Interesting...

      perl5.8.8 -wle '$s = "x" x 40_000; $s =~ /^(.)(\1*)/ and print length $2' # (segfaults) perl5.10.0 -wle '$s = "x" x 40_000; $s =~ /^(.)(\1*)/ and print length $2' Complex regular subexpression recursion limit (32766) exceeded at -e l +ine 1. 32767

      Well, at least it seems to warn when this limitation affects the result...

        I believe p5p is working on a patch to change that warning to a die.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://707579]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (6)
As of 2024-04-19 14:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found