Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re^5: Sorting a (very) large file (better*2)

by samtregar (Abbot)
on Nov 30, 2007 at 20:19 UTC ( [id://654195]=note: print w/replies, xml ) Need Help??


in reply to Re^4: Sorting a (very) large file (better*2)
in thread Sorting a (very) large file

God damn it, you're going to make me actually try it, aren't you?

UPDATE: What are you talking about? My solution wasn't an ST!

-sam

  • Comment on Re^5: Sorting a (very) large file (better*2)

Replies are listed 'Best First'.
Re^6: Sorting a (very) large file (better*2)
by samtregar (Abbot) on Nov 30, 2007 at 20:32 UTC
    Damn, that didn't work at all. After creating a suitably sized file with:

       perl -le 'for (0 .. shift()) { $num = int rand(100_000); print qq{12-25-2005\t12:30 PM\t$num\tC:\\someplace\\somefile.txt}; }' 8500000 > file.txt

    I found that just loading the file into memory took up 1.7gb on my system! I tried a few variations without getting anything reasonable. I haven't really done a lot of Perl dev on this box - it's a Fedora 8 machine with the Fedora-built v5.8.8. Could be there's something wrong with it.

    -sam

      this script...
      $| = 1; my @data; $#data = 8500000; $#data = -1; while(<>) { push @data, $_ } print "sorting...\n"; use Sort::Key qw(ukeysort_inplace); ukeysort_inplace { (split(/\t/))[2] } @data; print "sorted\n"; sleep 100;
      ... uses 800 MB for loading the data and then 50MB more for sorting!
        Good call - I should have thought to pre-extend the array!

        -sam

        This made me laugh out loud. The main difference without pre-extending is a (likely) hole in the heap of size 2**(n-1) left just after the array size was doubled to make it big enough to hold the entire file contents and the slop on the end of the array between the number of lines and 2**n (the final size of the array). In other words, two "dead" spans of virtual memory that will (at least mostly) remain untouched during the sorting and so will cause no page faulting and won't slow down the sorting at all (and will only moderately slow down the loading of the file contents into memory).

        So pre-extending shows that the lines being sorted can fit within 800MB of memory and so can be directly sorted without much if any page faulting. It cracks me up that samtregar threw up his hands in defeat thinking there's no point in trying to sort what won't fit in memory when the only extra page faulting caused by the process's virtual size exceeding physical memory had already taken place as part of filling the array. The rest of the sorting would only require pages totaling about 800MB.

        I also found it amusing that Sort::Key is doing pretty much exactly what I did except replacing my 3 lines of Perl code with a big pile of complex XS code.

        And then there is the problem of pre-extending requiring the file to be read twice which someone couldn't discount the cost of unless they are overlooking how much slower disk is than memory.

        Of course, condar may need to allocate a bit more paging space. I can't tell how much of his running out of virtual memory space was due to memory wasted by the ST or just due to having way too little paging space configured.

        - tye        

Re^6: Sorting a (very) large file (comparison)
by tye (Sage) on Nov 30, 2007 at 20:37 UTC

    Stess out much?

    You offered, instead of an ST, something that is slower but uses less memory. I prefer to use alternatives to the ST that tend to be two times faster (last time I bothered to time some) while also using a lot less memory.

    You mentioned the memory cost of temporary copies of each line. I don't think we should overlook the cost of the tons of anonymous arrays that also inflict a significant memory cost of an ST.

    - tye        

      Nah, I'm cool. But we both lost at this one - as far as I can tell you can't even load the @lines array of a 400MB file in 1GB of RAM. Unless my Perl is busted, which is possible but seems a bit unlikely.

      -sam

        you can't even load the @lines array of a 400MB file in 1GB of RAM

        I believe there is this thing called "virtual memory"... Note that in my original reply I also noted:

        And I'd check how much paging space is configured. It sounds like that could be increased.

        Virtual memory can be very useful. Especially if you sort something that can fit in physical memory and so only run through paging the larger things around a few times rather than having sort have to page through things something like O(log(N)) times.

        Hence my creation of a quite small array that is easy to sort. Actually, this makes me think of a minor change that might have a non-trivial impact on the size of the array that I sort:

        my @size= map 0 + ( split /\t/, $_ )[2], @in; # ^^^ store numbers not strings my @idx= sort { $size[$a] <=> $size[$b] }, 0..$#in; @in= @in[@idx];

        You could also play games to reduce the number of times you run through @in (each time causing paging), but I think the gains from that would be relatively minor in comparison while the complexity required is relatively great. Except, perhaps, for the last iteration:

        for( @idx ) { print $in[$_], $/; }

        (Though, I half expect Perl to do @in= @in[@idx]; in-place.)

        - tye        

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://654195]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (4)
As of 2024-04-19 16:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found