Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: search a large text file

by erix (Vicar)
on Feb 08, 2011 at 17:27 UTC ( #887014=note: print w/ replies, xml ) Need Help??


in reply to search a large text file

I put together an example in case you want to use PostgreSQL:

The file I used is available here:

ftp://ftp.ncbi.nih.gov/genbank/livelists

It's similar to yours; but it has three columns.

I unzipped it, and put it into postgres, in a table t; there are more than 223-million rows.

$ ls -lh GbAccList.0206.2011 -rw-rw-r-- 1 aardvark aardvark 4.6G Feb 8 17:21 GbAccList.0206.2011 $ head -n 3 GbAccList.0206.2011 AACY024124353,1,129566152 AACY024124495,1,129566175 AACY024124494,1,129566176 $ time < GbAccList.0206.2011 psql -qc " create table t (c text, i1 integer, i2 integer); copy t from stdin csv delimiter E',';" real 3m47.448s $ time echo " create index t_i2_idx on t (i2); analyze t;" | psql -q real 5m50.291s

Searches are now around a tenth of a millisecond:

# 5 'random' searches like: echo "explain analyze select * from t where i2 = $gi;" | psql

Just showing the timings of five searches:

Index Cond: (i2 = 2017697) Total runtime: 0.157 ms Index Cond: (i2 = 6895719) Total runtime: 0.109 ms Index Cond: (i2 = 3193323) Total runtime: 0.119 ms Index Cond: (i2 = 8319666) Total runtime: 0.091 ms Index Cond: (i2 = 1573171) Total runtime: 0.119 ms

Of course, performance depends on the hardware used.

(a similar problem/solution here: Re^3: sorting very large text files (slander))


Comment on Re: search a large text file
Select or Download Code
Replies are listed 'Best First'.
Re^2: search a large text file
by BrowserUk (Pope) on Feb 08, 2011 at 17:35 UTC

    Nice one again++ :)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://887014]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (8)
As of 2015-07-31 03:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (274 votes), past polls