http://www.perlmonks.org?node_id=887014


in reply to search a large text file

I put together an example in case you want to use PostgreSQL:

The file I used is available here:

ftp://ftp.ncbi.nih.gov/genbank/livelists

It's similar to yours; but it has three columns.

I unzipped it, and put it into postgres, in a table t; there are more than 223-million rows.

$ ls -lh GbAccList.0206.2011 -rw-rw-r-- 1 aardvark aardvark 4.6G Feb 8 17:21 GbAccList.0206.2011 $ head -n 3 GbAccList.0206.2011 AACY024124353,1,129566152 AACY024124495,1,129566175 AACY024124494,1,129566176 $ time < GbAccList.0206.2011 psql -qc " create table t (c text, i1 integer, i2 integer); copy t from stdin csv delimiter E',';" real 3m47.448s $ time echo " create index t_i2_idx on t (i2); analyze t;" | psql -q real 5m50.291s

Searches are now around a tenth of a millisecond:

# 5 'random' searches like: echo "explain analyze select * from t where i2 = $gi;" | psql

Just showing the timings of five searches:

Index Cond: (i2 = 2017697) Total runtime: 0.157 ms Index Cond: (i2 = 6895719) Total runtime: 0.109 ms Index Cond: (i2 = 3193323) Total runtime: 0.119 ms Index Cond: (i2 = 8319666) Total runtime: 0.091 ms Index Cond: (i2 = 1573171) Total runtime: 0.119 ms

Of course, performance depends on the hardware used.

(a similar problem/solution here: Re^3: sorting very large text files (slander))

Replies are listed 'Best First'.
Re^2: search a large text file
by BrowserUk (Patriarch) on Feb 08, 2011 at 17:35 UTC

    Nice one again++ :)