Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Schwartzian Transform and memory allocation.

by ThingFish (Beadle)
on Jun 19, 2002 at 16:00 UTC ( #175743=perlquestion: print w/ replies, xml ) Need Help??
ThingFish has asked for the wisdom of the Perl Monks concerning the following question:

I have a simple script that reads a small file (115k, 759 lines, 40 tab delimited columns per line) into an array and uses the Schwartzian Transform to sort on any given column. When I use this transform my process peaks out at approximately 5160000K, 40K over my 5MB of accessible address space (default on a chrooted ZEUS server, so Im told). The address space can only be bumped up another 2MB, so if my data source gets much larger, which it will, I'll be up the creek. Im surprised that I need 5MB to sort this little file, but at any rate I need another solution. Any ideas?
open FILE, "<$data_file"; @file_data = <FILE>; close FILE; @file_data = map { $_->[0] } sort { $b->[0] cmp $a->[0] } map { [ $_, (split(/\t/)) ] } @file_data; print qq(<TABLE BORDER="1">); for(@file_data){ print qq(<TR>); for(split(/\t/)){print qq(<TD>$_</TD>)}; print qq(</TR>); } print qq(</TABLE>);

Comment on Schwartzian Transform and memory allocation.
Download Code
Re: Schwartzian Transform and memory allocation.
by marvell (Pilgrim) on Jun 19, 2002 at 16:07 UTC
    Will skipping the whole initial @file_data thing and putting the <FILE> in the transform make a difference?

    --
    Steve Marvell

Re: Schwartzian Transform and memory allocation.
by caedes (Pilgrim) on Jun 19, 2002 at 16:13 UTC
    This sounds like a case for using a SQL database.
Re: Schwartzian Transform and memory allocation.
by runrig (Abbot) on Jun 19, 2002 at 16:16 UTC
    Instead of creating a temporary array reference, you can instead prepend the sort key to the data itself, sort that, then take the sort key off at the end, I think that's the Guttman/Rossler transform, which is a variation on the 'Schwartzian transform'. BTW, your code doesn't appear to be sorting on any particular column, it appears to just be doing a plain sort on the data.
(jeffa) Re: Schwartzian Transform and memory allocation.
by jeffa (Chancellor) on Jun 19, 2002 at 16:16 UTC
Re: Schwartzian Transform and memory allocation.
by thelenm (Vicar) on Jun 19, 2002 at 16:18 UTC
    It looks like you might not be sorting what you think you're sorting... you're sorting based on the entire line. Do you mean to sort on a specific column? If so, then change
    sort { $b->[0] cmp $a->[0] }
    to
    sort { $b->[$i] cmp $a->[$i] }
    where $i is the 1-based column number that you want to sort on.

    I don't think that will keep it from being a memory hog, though...

    -- Mike

    --
    just,my${.02}

Re: Schwartzian Transform and memory allocation.
by ThingFish (Beadle) on Jun 19, 2002 at 16:24 UTC
    In response to runrig's comment that the code posted "appears to just be doing a plain sort on the data", sorting on [0] was a little misleading. I do need to sort on other columns, my example just used the first. As a side note, a plain old sort would not bump my memory limit (at this file size anyway), but I need a more complex sort, hence my use of the Schwartz (Spaceballs, anyone?).
Space/time tradeoff, but you don't have space...
by RMGir (Prior) on Jun 19, 2002 at 16:26 UTC
    The ST is a space/time optimization, trading using extra space in order to save time.

    Since you DON'T have the extra space, why not just eat the extra time and do the split in your comparison?

    It might be worth benchmarking; it's going to be slower than ST or GRT would be, sure, but maybe it's tolerable for the data set sizes you're talking about. Sorting 759 items isn't THAT bad...

    $ perl -e'for(1..759){push @x,rand}; @x=sort {$count++; $a <=> $b} @x; + print "$count\n"' 6783

    --
    Mike
Re: Schwartzian Transform and memory allocation.
by Abigail-II (Bishop) on Jun 19, 2002 at 16:28 UTC
    I am confused. Why are you using a Schwartzian Transform? Your first map puts $_ as the first element of the anon array, it's the first element that's being used in the sort, and that's also that remains after the last map. So, effectively, you are splitting, and then just discarding the results.

    As for the memory usage, you should realize that during the sort, you have 759 * 42 scalars, and an additional 759 arrays. Each scalar has some overhead (a couple of dozens of bytes), arrays have even more. It doesn't look much, but since you have to multiply, it does add up. It might easily take 1.5 Mb. And then there's the perl binary itself. I don't know how much it takes on your server, but it can easily consume a few Mb itself (this varies from system to system).

    Abigail

Re: Schwartzian Transform and memory allocation.
by ThingFish (Beadle) on Jun 19, 2002 at 16:30 UTC
    More clarification here.. I do actually use a variable as an indice, as thelenm pointed out. The column to be sorted on is passed via query string and is plugged into the sort. Ex:
    @file_data = map { $_->[0] } sort { $a->[$variable] cmp $b->[$variable] } map { [ $_, (split(/\|/)) ] } @file_data;
      In that case, you could easily save some memory:
      @file_data = map { $_ -> [0]} sort { $a -> [1] cmp $b -> [1]} map {[$_ => (split /\|/) [$variable - 1]]} @file_data;
      No need to use a 759 41 element arrays if you can do it with just 2 elements!

      Abigail

        Modifying my original code as per Abigail-II's suggestion dropped memory usage an entire meg. Abigail-II++.
      Why split once for the Schwartzian Transform then again for the output?
      open FILE, "<$data_file"; my @file_data = sort { $a->[$variable-1] cmp $b->[$variable-1] }, map [ split(/\|/) ], <FILE>; close FILE; print qq(<TABLE BORDER="1">); for my $row (@file_data){ print ( qq(<TR>), map(qq(<TD>$_</TD>), @$row), qq(</TR>) ); } print qq(</TABLE>);
      ____________
      Makeshifts last the longest.
        I think the

        sort { $a->[$variable-1] cmp $b->[$variable-1] },

        line should be

        sort { $a->[$variable] cmp $b->[$variable] },

        The -1 was added by abigail to skip over the $_ that was tacked on at the beginning of the anon array. You're no longer doing that.

        --

        flounder

Re: Schwartzian Transform and memory allocation.
by shotgunefx (Parson) on Jun 19, 2002 at 16:37 UTC
    If you are really worried about future size and don't want to go to a full blown RDBM, why not use DB_File to create a temp database in /tmp and use a custom BTREE for the sort.

    -Lee

    "To be civilized is to deny one's nature."
Re: Schwartzian Transform and memory allocation.
by Matts (Deacon) on Jun 19, 2002 at 21:06 UTC
Re: Schwartzian Transform and memory allocation.
by grantm (Parson) on Jun 19, 2002 at 21:55 UTC
    If you translated your data to XML, you could sort it with XML::Filter::Sort (which I uploaded to CPAN a few days ago). One feature of this module is that you can specify the amount of memory (not including interpreter overheads) that the sort can use. Any overflow will be handled by using temporary sort/merge files.

    I'm not going to pretend it would be fast though - probably several orders of magnitude slower than what you're doing now.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://175743]
Approved by FoxtrotUniform
Front-paged by SparkeyG
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (16)
As of 2014-07-25 20:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (175 votes), past polls