Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

regex on gigabyte string

by focusonz (Initiate)
on Jan 26, 2013 at 16:12 UTC ( #1015509=perlquestion: print w/ replies, xml ) Need Help??
focusonz has asked for the wisdom of the Perl Monks concerning the following question:

my @celltags = ($bigstring =~ /(<c.*?\/c>)/g); surrenders (returns zero length array) on a 5GB xml string using 5.16.2 perl 64bit! Works OK on 1GB string!

Any ruminations on such an eclectic matter?

Comment on regex on gigabyte string
Re: regex on gigabyte string
by Athanasius (Monsignor) on Jan 26, 2013 at 16:30 UTC

    Hello focusonz, and welcome to the Monastery!

    Do you need @celltags to be fully populated before proceeding? If not, it might be worth trying:

    while (m{(<c.*?/c>)}g) { # Process $1... }

    Just a thought (untested). Hope it helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Tried that but took interminably long execution time even on the 1GB string.

      The my @celltags = ($bigstring =~ /(<c.*?\/c>)/g); executed quickly on 1GB strings.

Re: regex on gigabyte string
by RichardK (Priest) on Jan 26, 2013 at 16:50 UTC

    You might find it easier to use XML::Twig which is designed to process huge xml files.

      Thanks Maybe I will test it next time around

      I settled on using a substring() algorithm to parse up the $bigstring which is fast and now processing a total of upwards of 30GB in 5GB chunks.

Re: regex on gigabyte string
by LanX (Canon) on Jan 26, 2013 at 16:54 UTC
    As others said, better use a real XML module!

    > Any ruminations on such an eclectic matter?

    Regarding 5 GB strings, I wouldn't be surprised if it couldn't be sufficiently held in memory and Perl's attempts to outsource leads to problems.

    Cheers Rolf

      I am running perl 5.16.2 64bit on a 16GB I7 machine and have given it a cursory test to the 16GB max and have found no other problems

      It could be the regex engine in x64 perl still has a 4GB limit?? substring() works out to 16GB $bigstring

Re: regex on gigabyte string
by dave_the_m (Parson) on Jan 26, 2013 at 17:46 UTC
    The regex engine is largely 32-bit; in particular, the indices of captures within a string are stored as 32-bit signed values, so they can't do captures more than 2Gb into a string.

    Dave.

      Worse than that, I've seen tools explicitly dump something like (...)*? in a regex as something very close to (...){0,32766}?, because repetition only supported 15 bits, not 32, at least in some cases (but maybe that isn't true of modern versions of Perl). But it also seemed like those tools didn't always do such. So I'm not sure how often that limitation applies.

      But it is easy to find the breaking point for this particular regex:

      $ perl -del DB<1> x 0+( () = join('','<c>','x'x(1<<30),'</c>') =~ m{<c.*?/c>}g ) 0 1 DB<2> x 0+( () = join('','<c>','x'x(1<<31),'</c>') =~ m{<c.*?/c>}g ) 0 0 DB<3> x 0+( () = join('','<c>','x'x((1<<31)-8),'</c>') =~ m{<c.*?/c> +}g ) 0 1 DB<4> x 0+( () = join('','<c>','x'x((1<<31)-7),'</c>') =~ m{<c.*?/c> +}g ) 0 0

      So (my version of) Perl can't deal with a capture string of more than 2**31-1 characters. And:

      $ perl -del DB<2> x 0+( () = join('',('<c>','x'x((1<<30)-10),'</c>')x2) =~ m{<c. +*?/c>}g ) 0 2 DB<1> x 0+( () = join('',('<c>','x'x((1<<30)-10),'</c>')x3) =~ m{<c. +*?/c>}g ) 0 0

      Surprisingly, it fails to even find the first match if there is a match beyond the 2**31-1 character position? Even trying to iterate to that point doesn't really help (perhaps .*? backtracks?):

      $ perl -del DB<1> $x = join('',('<c>','x'x((1<<30)-10),'</c>')x2); while( $x =~ +m{<c.*?/c>}g ) { print pos($x), $/ } 1073741821 2147483642 DB<1> $x = join('',('<c>','x'x((1<<30)-10),'</c>')x3); while( $x =~ +m{<c.*?/c>}g ) { print pos($x), $/ } DB<2>

      So one needs to deal with the string in reasonably-sized chunks. Which makes me wonder which XML-parsing modules manage to get that right. Their test suites should include a tag with a 4GB attribute value (with an escaped character at the end). :)

      - tye        

      There you go! That is the shortcoming.

      keywords: PERL 64bit(x54)and 32bit(x86) MAXIMUM STRING SIZE REGEX ENGINE 2GB per dave_the_m and focusonz test

      keywords: PERL MAXIMUM STRING SIZE 4GB 32bit(x86) ALSO IS WINDOWS X86 LIMIT

      keywords: PERL MAXIMUM STRING SIZE 16GB 64bit(x64) ALSO IS Windows 7 Home Premium X64 LIMIT

      See http://msdn.microsoft.com/en-us/library/windows/desktop/aa366778%28v=vs.85%29.aspx#physical_memory_limits_windows_7
Re: regex on gigabyte string
by BrowserUk (Pope) on Jan 26, 2013 at 18:27 UTC

    Whilst loading strings > 4GB is no problem on a 64-bit Perl (assuming you have the memory), unfortunately, there are still many places in the core where such huge strings are simply not supported.

    Two examples:

    1. substr doesn't accept offsets > 2GB
    2. Regexes don't operate on strings > 2GB.

    Its a pain in the lower lumbar region, but probably won't change any time soon.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Whoa back!

      I am using a construct of if( substr($bigstring, $begtagidx, 5) eq "<c r=" ) Where $begtagidx is out to 4 billion and have not seen problem.

      But the data verification process is not yet terminated so I will have to get back to you cloistered people on this.

      thanks for the pearls of scripture !
        Where $begtagidx is out to 4 billion and have not seen problem.

        Okay. It seems that limitation has been lifted with 5.16 (I still use 5.10.1 as my primary Perl where it is the case):

        say $];; 5.016001 $s = 'fred'; $s x= 1024**3;; print substr( $s, -4 );; fred

        But the 2GB limit on regex still persists in 5.16:

        [19:51:25.70] C:\test>\perl64-16\bin\perl \perl64\bin\p1.pl [0] Perl> say $];; 5.016001 [0] Perl> $s = 'fred'; $s x= 1024**3;; [0] Perl> ++$n while $s =~ /fred/g; say $n;; Use of uninitialized value $n in say at (eval 9) line 1, <STDIN> line +3. [0] Perl> $s = 'fr'; $s x= 1024**3;; [0] Perl> ++$n while $s =~ /fr/g; say $n;; Use of uninitialized value $n in say at (eval 11) line 1, <STDIN> line + 5. [0] Perl> $s = 'fr'; $s x= 1020**3;; [0] Perl> ++$n while $s =~ /fr/g; say $n;; 1061208000 [0] Perl>

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1015509]
Approved by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (10)
As of 2014-09-30 13:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (369 votes), past polls