Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

Re^2: regex on gigabyte string (31 bits)

by tye (Sage)
on Jan 26, 2013 at 19:11 UTC ( #1015527=note: print w/replies, xml ) Need Help??

in reply to Re: regex on gigabyte string
in thread regex on gigabyte string

Worse than that, I've seen tools explicitly dump something like (...)*? in a regex as something very close to (...){0,32766}?, because repetition only supported 15 bits, not 32, at least in some cases (but maybe that isn't true of modern versions of Perl). But it also seemed like those tools didn't always do such. So I'm not sure how often that limitation applies.

But it is easy to find the breaking point for this particular regex:

$ perl -del DB<1> x 0+( () = join('','<c>','x'x(1<<30),'</c>') =~ m{<c.*?/c>}g ) 0 1 DB<2> x 0+( () = join('','<c>','x'x(1<<31),'</c>') =~ m{<c.*?/c>}g ) 0 0 DB<3> x 0+( () = join('','<c>','x'x((1<<31)-8),'</c>') =~ m{<c.*?/c> +}g ) 0 1 DB<4> x 0+( () = join('','<c>','x'x((1<<31)-7),'</c>') =~ m{<c.*?/c> +}g ) 0 0

So (my version of) Perl can't deal with a capture string of more than 2**31-1 characters. And:

$ perl -del DB<2> x 0+( () = join('',('<c>','x'x((1<<30)-10),'</c>')x2) =~ m{<c. +*?/c>}g ) 0 2 DB<1> x 0+( () = join('',('<c>','x'x((1<<30)-10),'</c>')x3) =~ m{<c. +*?/c>}g ) 0 0

Surprisingly, it fails to even find the first match if there is a match beyond the 2**31-1 character position? Even trying to iterate to that point doesn't really help (perhaps .*? backtracks?):

$ perl -del DB<1> $x = join('',('<c>','x'x((1<<30)-10),'</c>')x2); while( $x =~ +m{<c.*?/c>}g ) { print pos($x), $/ } 1073741821 2147483642 DB<1> $x = join('',('<c>','x'x((1<<30)-10),'</c>')x3); while( $x =~ +m{<c.*?/c>}g ) { print pos($x), $/ } DB<2>

So one needs to deal with the string in reasonably-sized chunks. Which makes me wonder which XML-parsing modules manage to get that right. Their test suites should include a tag with a 4GB attribute value (with an escaped character at the end). :)

- tye        

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1015527]
[marto]: this 'safe space' thing you have going on seems just like wanting to whine without anyone replying
[Corion]: usemodperl: Naah, if you scroll back, you'll find several approaches that still work. Maybe a prewrapped solution like CPAN::Mini is what you want. Or maybe App::FatPacker. But you don't seem to want to listen. That's OK.
[Corion]: There, there. It'll all be better.
[marto]: usemodperl "so to answer my question, no http at cpan," doesn't tie up with what you said "I can find cpan mirrors on http"
[usemodperl]: yea but http only has tar.gz, i wanna download modules with core perl, but http seems to make it impossible, that's my only question, how to find http mirrors like meta, or how to do it with core perl, but options now seem totally broken (on purpose :-(
marto wanders off
[usemodperl]: Corion it's really not misguided, it's the only way, to do something... wonderful IMHO
[Corion]: usemodperl: Why don't you set up your own (http-only) CPAN mirror? Or just fatpack your scripts? I wonder what problem you're trying to solve here.
[usemodperl]: don't worry about that, it's really cool, i promise!
[Corion]: usemodperl: Well, if the world changes and makes your "wonderful" approach not work anymore, you can either change your approach, or change the world. You seem to want to change others instead.

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (8)
As of 2018-06-24 16:29 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (126 votes). Check out past polls.