http://www.perlmonks.org?node_id=813714


in reply to sorting very large text files

Isn't this the kind of job where a database system like PostgreSQL, MySQL and the sort will perform better?
Just wondering.

Replies are listed 'Best First'.
Re^2: sorting very large text files
by BrowserUk (Patriarch) on Dec 21, 2009 at 15:02 UTC

    No! It'll take longer to just import the data into the RDBMS than your local system sort utility will take to complete.

    And after that, you've still got to index the data, perform the sorting query and output the data sorted.

    Allow for anything from 5 to 10 times as long to achieve the final goal of sorted data in a single file on disk. Maybe more.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk >>[...] from 5 to 10 times as long ... Maybe more.

      Sort will obviously always be faster than using a database (here, PostgreSQL) for loading and querying for the same result, but I want to show that you exaggerate the overhead of loading+quering into an output file. Indexing would be redundant, but I bet even with an index it wouldn't take that much time.

      I constructed a similar file, and sorted by the dna column in the two ways under consideration, (sort versus database slurp+query).

      Result: System sort: 59 minutes. Postgres load (6m) + select (59m): 66 minutes
      $ ls -lh dna2.txt -rw-rw-r-- 1 xxxxx xxxxx 12G Dec 21 16:38 dna2.txt 1588 mm_ref_chrY.fa 2902494 R CTTGACTTTTATGTGTCACTGTGTTCAGTT +CCTGGT 939 mm_ref_chrY.fa 2902495 F CCAGGAACTGACCACAGTGACACATAAGAG +TCAAAT 742 mm_ref_chrY.fa 2902497 F AGGAACTGACCACAGTGACACATAAGAGTC +AAATAG 1097 mm_ref_chrY.fa 2902497 F AGGAACTGACCACAGTGACACATAAGAGTC +AAATAG 100 mm_ref_chrY.fa 2902499 F GAACTGACCACAGTGACACATAAAAGTCAA +GTAGTG 1184 mm_ref_chrY.fa 2902499 F GAACTGACCACAGTGACACATAAAAGTCAA +GTAGTG 286 mm_ref_chrY.fa 2902505 F ACCACAGTGCCACATAAAAGTCAAGTAGGG +AATCCT 235 mm_ref_chrY.fa 2902513 R ACCAGGACAGGATCCCCTACTTGACTTTTA +TGTGGC 1744 mm_ref_chrY.fa 2902516 F ACATAAAAGTCAAGTAGTGGACCCTGTCCT +GGTCTG 1029 mm_ref_chrY.fa 2902519 F TAAAAGTCAAGTAGGGGATCCTGTCCTGGT +CTGGCA # # bash + sort version: # $ time sort -k5 dna2.txt > dna2.sortk5.out real 59m48.641s # # Postgres version: # # # loading the data: # $ time < dna2.txt \ psql -d test -c " drop table if exists dna_test; create table dna_test( y integer, chromosome text, genomic_location integer, direction text, seq text); copy dna_test from stdin csv delimiter E'\t'; " ; real 6m20.430s echo " select to_char(count(*), '999G999G999') as rowcount from dna_test" | psql -d test rowcount -------------- 181,261,572 (1 row) # # querying a resultset into a sorted file: # $ time echo " copy (select * from dna_test order by seq) to stdout" \ | psql -1qtAd test > dna_test.order_by_seq.out real 59m12.569s

      So:

           unix sort: real    59m48.641s
      table ORDER BY: real    59m12.569s  
      the latter preceded by   6m20.430s  overhead for loading table data
      
      

      So much for your (BrowserUK's) guess: "database takes 5 to 10 times as long as system sort ... Maybe more"...

      It almost amounts to slander :)

      Update 1

      Of course, I couldn't resist to trying with an index as well. and sure enough it's useless / not used:

      $ time echo " create index dna_test_seq_idx on dna_test (seq) " | psql -d test CREATE INDEX real 63m20.451s

      63m to create the index - that makes sense.

      But of course, Pg will not use it:

      $ time echo " explain analyze select * from dna_test order by seq " | psql -1qtAd test > dna_test.order_by_seq.explain_analyze.txt real 57m44.054s $ cat dna_test.order_by_seq.explain_analyze.txt Sort (cost=35550611.84..36003765.76 rows=181261568 width=62) (actual +time=1834228.291..3428773.879 rows=181261572 loops=1) Sort Key: seq Sort Method: external merge Disk: 12933032kB -> Seq Scan on dna_test (cost=0.00..3872406.68 rows=181261568 widt +h=62) (actual time=17.966..65802.621 rows=181261572 loops=1) Total runtime: 3463580.243 ms

      update 2: removed a useless use of cat, lest I receive another uuc-award...

        It almost amounts to slander :)

        Nice one! A DB answer with real code and data. A rare animal.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.