in reply to
Better way to work with large TSV files?
I'm finding that most databases are not happy with these sizes. I use java on the server so I don't know the perl equivalant, but what works for fast access (web application) is to leave the tsv alone, then build binary index files for the queried fields.
Each column gets a subdirectory, and each value gets a file which is a list of 64 bit numbers into the original tsv for the corresponding record for that value.
If your filesystem can handle it (ext3?) then this works for lots of unique values, and even range searches, and of course you can sort the filenames to get the results back in a certain order.
Multi column queries are handled by using the intersection(AND) or union(OR) of a list of pointers. Putting a little effort into figuring out which column/value is the smallest for a starting point helps with AND.
Once you have your final list of pointers, you can use randomaccessfile and fetch the corresponding records quickly and add them to the response.
I know this sounds like building from scratch, but search engines use a similiar technique. And I have spent far less time doing it the right way than softening my head on various dbms's and related nuances. And it is very memory friendly and fast. This works well for query applications, and I rebuild 7 column indexes on a 20gig tsv file and it is good to go, so putting up an updated tsv is fairly trivial too.