Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
I'm finding that most databases are not happy with these sizes. I use java on the server so I don't know the perl equivalant, but what works for fast access (web application) is to leave the tsv alone, then build binary index files for the queried fields. Each column gets a subdirectory, and each value gets a file which is a list of 64 bit numbers into the original tsv for the corresponding record for that value. If your filesystem can handle it (ext3?) then this works for lots of unique values, and even range searches, and of course you can sort the filenames to get the results back in a certain order. Multi column queries are handled by using the intersection(AND) or union(OR) of a list of pointers. Putting a little effort into figuring out which column/value is the smallest for a starting point helps with AND. Once you have your final list of pointers, you can use randomaccessfile and fetch the corresponding records quickly and add them to the response. I know this sounds like building from scratch, but search engines use a similiar technique. And I have spent far less time doing it the right way than softening my head on various dbms's and related nuances. And it is very memory friendly and fast. This works well for query applications, and I rebuild 7 column indexes on a 20gig tsv file and it is good to go, so putting up an updated tsv is fairly trivial too.

In reply to Re: Better way to work with large TSV files? by Anonymous Monk
in thread Better way to work with large TSV files? by radiantmatrix

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (3)
As of 2024-04-19 01:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found