Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: Tokenising a 10MB file trashes a 2GB machine

by Anonymous Monk
on Jul 16, 2008 at 09:36 UTC ( [id://697915]=note: print w/replies, xml ) Need Help??


in reply to Tokenising a 10MB file trashes a 2GB machine

use re 'debug'; and see if it sheds any light, this replicates your results
perl -Mre=debug -e"$f = q~a b c~ x 1;$g = [split m{\p{IsSpace}}ms, $f +];" perl -e"die 1E7" perl -Mre=debug -e"$f = q~a b c~ x 1E7;$g = [split m{\p{IsSpace}}ms, $ +f ];" 2>2

Replies are listed 'Best First'.
Re^2: Tokenising a 10MB file trashes a 2GB machine
by PetaMem (Priest) on Jul 16, 2008 at 10:39 UTC
    perl -Mre=debug -e"$f = q~a b c~ x 1E4;$g = [split m{\p{IsSpace}}ms, $ +f ];" 2>2
    The sheer size of the debugger output makes it impossible to run with the 1E7 multiplier and although I still do not know how to interpret the output, maybe someone here knows.
    MultiplierSize of debugger output
    114KiB
    1021KiB
    100137KiB
    1E35,7MiB
    1E4507MiB
    1E550GiB

    Therefore I predict output of the debugger would be (at least) about 5TiB for 1E6. The size comes from the fact, that there is always a printout of the complete dataset that will be matched against, which is every time the regexp matches shortened by one token. Therefore the numbers mentioned above halve if we have e.g. q{1234 } instead q{a b c}.

    In between these printouts there is always the same output:

    Matching REx "\p{IsSpace}" against "1234 1234 1234 1234 1234 1234 1234 + " Matching stclass "ANYOF[{unicode}\11-\15 \302\205\302\240...+utf8::IsS +pace]" against "1234 1234 1234 1234 1234 1234 1234 " Setting an EVAL scope, savestack=6 49969 < 1234> < 1234 1> | 1: ANYOF[{unicode}\11-\15 \302\205\302\ +240...+utf8::IsSpace] 49970 <1234 > <1234 12> | 13: END Match successful! Matching REx "\p{IsSpace}" against "1234 1234 1234 1234 1234 1234 " Matching stclass "ANYOF[{unicode}\11-\15 \302\205\302\240...+utf8::IsS +pace]" against "1234 1234 1234 1234 1234 1234 " Setting an EVAL scope, savestack=6 49974 < 1234> < 1234 1> | 1: ANYOF[{unicode}\11-\15 \302\205\302\ +240...+utf8::IsSpace] 49975 <1234 > <1234 12> | 13: END Match successful! Matching REx "\p{IsSpace}" against "1234 1234 1234 1234 1234 " Matching stclass "ANYOF[{unicode}\11-\15 \302\205\302\240...+utf8::IsS +pace]" against "1234 1234 1234 1234 1234 " Setting an EVAL scope, savestack=6 49979 < 1234> < 1234 1> | 1: ANYOF[{unicode}\11-\15 \302\205\302\ +240...+utf8::IsSpace] 49980 <1234 > <1234 12> | 13: END Match successful! Matching REx "\p{IsSpace}" against "1234 1234 1234 1234 "

    (this is taken from near the end of the debugger output to keep the size of the data sections small)

    So unfortunately I do not see much from this output that could give me a hint for the additional memory consumption. Except probably the "savestack=6", but I guess that is on every other perl interpreter the same. I'll try to compile Perl conservatively with an old GCC and generic CPU architecture (maybe the new gcc does some wasting alignments for Core2 architecture).

    Bye
     PetaMem
        All Perl:   MT, NLP, NLU

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://697915]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (6)
As of 2024-04-23 07:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found