Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

Re^3: sorting very large text files (slander)

by erix (Prior)
on Dec 21, 2009 at 20:39 UTC ( [id://813757]=note: print w/replies, xml ) Need Help??

in reply to Re^2: sorting very large text files
in thread sorting very large text files

BrowserUk >>[...] from 5 to 10 times as long ... Maybe more.

Sort will obviously always be faster than using a database (here, PostgreSQL) for loading and querying for the same result, but I want to show that you exaggerate the overhead of loading+quering into an output file. Indexing would be redundant, but I bet even with an index it wouldn't take that much time.

I constructed a similar file, and sorted by the dna column in the two ways under consideration, (sort versus database slurp+query).

Result: System sort: 59 minutes. Postgres load (6m) + select (59m): 66 minutes
$ ls -lh dna2.txt -rw-rw-r-- 1 xxxxx xxxxx 12G Dec 21 16:38 dna2.txt 1588 mm_ref_chrY.fa 2902494 R CTTGACTTTTATGTGTCACTGTGTTCAGTT +CCTGGT 939 mm_ref_chrY.fa 2902495 F CCAGGAACTGACCACAGTGACACATAAGAG +TCAAAT 742 mm_ref_chrY.fa 2902497 F AGGAACTGACCACAGTGACACATAAGAGTC +AAATAG 1097 mm_ref_chrY.fa 2902497 F AGGAACTGACCACAGTGACACATAAGAGTC +AAATAG 100 mm_ref_chrY.fa 2902499 F GAACTGACCACAGTGACACATAAAAGTCAA +GTAGTG 1184 mm_ref_chrY.fa 2902499 F GAACTGACCACAGTGACACATAAAAGTCAA +GTAGTG 286 mm_ref_chrY.fa 2902505 F ACCACAGTGCCACATAAAAGTCAAGTAGGG +AATCCT 235 mm_ref_chrY.fa 2902513 R ACCAGGACAGGATCCCCTACTTGACTTTTA +TGTGGC 1744 mm_ref_chrY.fa 2902516 F ACATAAAAGTCAAGTAGTGGACCCTGTCCT +GGTCTG 1029 mm_ref_chrY.fa 2902519 F TAAAAGTCAAGTAGGGGATCCTGTCCTGGT +CTGGCA # # bash + sort version: # $ time sort -k5 dna2.txt > dna2.sortk5.out real 59m48.641s # # Postgres version: # # # loading the data: # $ time < dna2.txt \ psql -d test -c " drop table if exists dna_test; create table dna_test( y integer, chromosome text, genomic_location integer, direction text, seq text); copy dna_test from stdin csv delimiter E'\t'; " ; real 6m20.430s echo " select to_char(count(*), '999G999G999') as rowcount from dna_test" | psql -d test rowcount -------------- 181,261,572 (1 row) # # querying a resultset into a sorted file: # $ time echo " copy (select * from dna_test order by seq) to stdout" \ | psql -1qtAd test > dna_test.order_by_seq.out real 59m12.569s


     unix sort: real    59m48.641s
table ORDER BY: real    59m12.569s  
the latter preceded by   6m20.430s  overhead for loading table data

So much for your (BrowserUK's) guess: "database takes 5 to 10 times as long as system sort ... Maybe more"...

It almost amounts to slander :)

Update 1

Of course, I couldn't resist to trying with an index as well. and sure enough it's useless / not used:

$ time echo " create index dna_test_seq_idx on dna_test (seq) " | psql -d test CREATE INDEX real 63m20.451s

63m to create the index - that makes sense.

But of course, Pg will not use it:

$ time echo " explain analyze select * from dna_test order by seq " | psql -1qtAd test > dna_test.order_by_seq.explain_analyze.txt real 57m44.054s $ cat dna_test.order_by_seq.explain_analyze.txt Sort (cost=35550611.84..36003765.76 rows=181261568 width=62) (actual +time=1834228.291..3428773.879 rows=181261572 loops=1) Sort Key: seq Sort Method: external merge Disk: 12933032kB -> Seq Scan on dna_test (cost=0.00..3872406.68 rows=181261568 widt +h=62) (actual time=17.966..65802.621 rows=181261572 loops=1) Total runtime: 3463580.243 ms

update 2: removed a useless use of cat, lest I receive another uuc-award...

Replies are listed 'Best First'.
Re^4: sorting very large text files (slander)
by BrowserUk (Patriarch) on Dec 21, 2009 at 20:56 UTC
    It almost amounts to slander :)

    Nice one! A DB answer with real code and data. A rare animal.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://813757]
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2024-06-22 23:29 GMT
Find Nodes?
    Voting Booth?

    No recent polls found

    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.