Store a huge amount of data on disk

Sewi has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I need to store a huge amount of data having a fixed structure:

Each item has a unique (alphanummeric, 7-bit-ASCII) id

A fixed number of "meta" information fields contain numbers or text data up to 100 bytes (worst case, usually <30 bytes)

meta information won't change once the item has been created

Each item has two text parts usually 2-16k in size, somethimes some MB, but up to 2 GB have to be supported

The text parts are delivered in blocks up to a predefined size limit (currently about 16 MB, but may be changed to anything from ~1k if storage requires a change), currently typically 1900 bytes

The final text part size is unknown, same for the number of blocks

The blocks may not arrive in sequential order, but they contain a sequence number starting from zero for each item, every sequence number is used

Up to 10 mio. items should be stored at the same time, maybe more in the future

About 90% of the items may be deleted some weeks after they were created

Some of the remaining are deleted later, few are kept forever

Each item must be accessible quickly by unique item id

Deletion of items may be really slow

I considered using MongoDB, but it's becoming slow for 15+ mio. items and has a 16 MB limit per item. mySQL can't handle this amount, too. I'd like to store the stuff in files, but avoid one file per item as these many files are hard to handle for filesystems.

I considered tie and GDBM_File which is rock solid on reading, I could store many items in one file, delete them and append/insert text blocks as they are arriving, but GDBM is critical when more than one process is writing the same file and I'm not sure that no two process will ever write the same file as new text blocks are arriving for different messages.

Any suggestions?

Try Padre - the free Perl IDE · my (perl) blog

Comment on Store a huge amount of data on disk

Replies are listed 'Best First'.
Re: Store a huge amount of data on disk by erix (Prior) on Oct 18, 2011 at 15:43 UTC
You omit a crucial piece of information: what is the typical query? What (how much) does it retrieve and how fast does it need to be? ('quickly' is not very informative...) I don't see anything in the specification that rules out the most obvious solution, PostgreSQL. Update: text values in postgres do have a limit of 1 GB (see the manual).	[reply]
Re^2: Store a huge amount of data on disk by Sewi (Friar) on Oct 18, 2011 at 15:52 UTC
Typical query is "one item by id", no other queries than "by id" are required. The deletion cronjob may crawl through all objects to find deletion candidates. Do you think Postgres would handle that amount of data? I used it for a analysis of some million shorter records (mysql profiling data :-) ) lately and felt like it got slower when importing/dealing with a great number of rows. I'll try...	[reply]
Re^3: Store a huge amount of data on disk by erix (Prior) on Oct 18, 2011 at 17:08 UTC
Whether it is fast enough depends, I think, as much on the disks on your system as on the software that you'll use to write to them. From what you mentioned I suppose the total size to be something like 300 GB? It's probably useful/necessary (for postgres, or any other RDBMS) to have some criterium (date, perhaps) by which to partition. (FWIW, a 40 GB table that we use intensively, accessed by unique id, gives access times of less than 100 ms. System has 32 GB, and a 8-disk raid10 array.) Btw, postgresql does have a limit for text column values (1 GB, where you need 2 GB, but I suppose that could be avoided by splitting the value or something like that)	[reply]
Re^4: Store a huge amount of data on disk by Sewi (Friar) on Oct 18, 2011 at 18:41 UTC
Re: Store a huge amount of data on disk by BrowserUk (Patriarch) on Oct 18, 2011 at 15:37 UTC
Each item has a unique (alphanummeric, 7-bit-ASCII) id How long? (Ie. What range?)	[reply]
Re^2: Store a huge amount of data on disk by Sewi (Friar) on Oct 18, 2011 at 15:48 UTC
About 16 to 32 bytes, any limit >= 16 bytes would be ok and may still be applied. I should be able to switch this into a 64 bit integer number if required, but I prefer the current alpha ids.	[reply]
Re^3: Store a huge amount of data on disk by BrowserUk (Patriarch) on Oct 19, 2011 at 00:16 UTC
Sounds like you're indexing your data by a hex-encoded digest? Given that you have 3 variable & possible huge sized chunks -- which most RDBMSs handle by writing the filesystem anyway -- associated with each index key, and your selection criteria are both fixed & simple, I'd use the filesystem. Subdivide the key into chunks that make individual directories contain at most a reasonable number of entries and then store the 3 sections in files at the deepest level. By splitting a 32-byte hex digest into 4-char chunks, no directory has more than 256 entries. The file-system cache will cache the lower levels and the upper levels will be both fast to read from disk and quick to search. Especially if your file-system hashes its directory entries. I'd write the individual chunks of the two text parts in separate files unless they will always be loaded as a single entity, in which case it might be slightly faster to concatenate them. Overall, given a digest of `8fbe7eb8c04c744406cca0aeb67e4f7f`, I'd lay the directory structure out like this: `/data/8fbe/7eb8/c04c/7444/06cc/a0ae/b67e/4f7f/meta.txt /data/8fbe/7eb8/c04c/7444/06cc/a0ae/b67e/4f7f/text1.000 /data/8fbe/7eb8/c04c/7444/06cc/a0ae/b67e/4f7f/text1.001 /data/8fbe/7eb8/c04c/7444/06cc/a0ae/b67e/4f7f/text1.002 /data/8fbe/7eb8/c04c/7444/06cc/a0ae/b67e/4f7f/text1.... /data/8fbe/7eb8/c04c/7444/06cc/a0ae/b67e/4f7f/text2.000 /data/8fbe/7eb8/c04c/7444/06cc/a0ae/b67e/4f7f/text2.001 /data/8fbe/7eb8/c04c/7444/06cc/a0ae/b67e/4f7f/text2....` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^4: Store a huge amount of data on disk by Sewi (Friar) on Oct 19, 2011 at 05:13 UTC
Re^5: Store a huge amount of data on disk by BrowserUk (Patriarch) on Oct 19, 2011 at 14:51 UTC
Re^5: Store a huge amount of data on disk by zentara (Archbishop) on Oct 19, 2011 at 16:34 UTC


Come for the quick hacks, stay for the epiphanies.
	PerlMonks