Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)

rjahrman has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?) by davido (Cardinal) on Jul 24, 2004 at 05:13 UTC
I would use a database like SQLite instead of a multi-megabyte flat file punished with random access. You cannot insert data into the middle of a flat file. You can allocate a gigantic file and pre-subdivide it into fixed-length records of sufficient size that you'll never fill one up completely, but that's tricky and not scalable. You could use Tie::File to treat the file as an array, but doing massive amounts of mid-array inserts is very slow with a tied array, because again, it's really just working on a flat file behind the scenes. This really is a problem best delt with via a database. I'm not positive SQLite is the best one for the job, but it is pretty easy to install, self-contained, and stores all of its data in one file. Dave	[reply]
Re^2: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?) by rjahrman (Scribe) on Jul 24, 2004 at 05:42 UTC
"You cannot insert data into the middle of a flat file." Is this an actual limitation, or are you saying that this is a bad idea? "I would use a database like SQLite" My concern is how the database would do this. Wouldn't it be doing the exact same thing? Also, since the only way to append to a BLOB that I've seen is to do an "update . . . set this_blob = concat(this_blob,new_int)", wouldn't that be even less efficient?	[reply]
Re^3: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?) by davido (Cardinal) on Jul 24, 2004 at 05:48 UTC
Literally, you cannot INSERT (in other words, grow a file by adding something to the middle). You can only append to files, or overwrite what's in the middle. Disk operating systems don't grow files from the middle. So the commonly used solution is to read the file one line at a time, writing out to a new file one line at a time... when you get to the part where you want to insert, write out the new data, and then continue writing the remainder of the old data to the new file. When finished, replace the old file with the new one. This process is slow for big files with lots of 'inserts'. This is where databases make sense. Dave	[reply]
Re^4: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?) by rjahrman (Scribe) on Jul 24, 2004 at 06:01 UTC
Re^5: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?) by davido (Cardinal) on Jul 24, 2004 at 06:15 UTC
Some notes below your chosen depth have not been shown here
Re^3: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?) by tilly (Archbishop) on Jul 24, 2004 at 16:10 UTC
If you want to know how a database could tackle a problem like this of mapping IDs to arbitrary information, read this article on BTrees. Then do as perrin said and use BerkeleyDB. That solves this problem in a highly optimized way, in C. If the dataset is large enough that it won't fit in RAM, then you probably want to ask it to build you a BTree rather than a hash. A hash is better if the data all fits in RAM.	[reply]
Re: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?) by perrin (Chancellor) on Jul 24, 2004 at 06:24 UTC
I would just use BerkeleyDB. It solves this exact problem and can handle terabyte-size databases.	[reply]
Re: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?) by Zaxo (Archbishop) on Jul 24, 2004 at 05:18 UTC
You haven't said how the files are named. I'd first try distributing them among subdirectories - about three deep. The bookkeeping for a single file with lots of references to offsets within seems excessive. One million files in a single directory does, too. After Compline, Zaxo	[reply]
Re^2: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?) by rjahrman (Scribe) on Jul 24, 2004 at 05:43 UTC
That's exactly what I was going to do . . . but enough of the files are <1KB to make clustering a concern.	[reply]
Re: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?) by matija (Priest) on Jul 24, 2004 at 06:33 UTC
Let's keep things in perspective: a million of 4KB sized is 4GB of space. Today, that costs about $4. Tomorrow, it will cost less. So unless you're talking about a legacy system, space is not that much of a concern. Speed and convenience of access should be, though. Managing a million of files is going to be a major pain in the neck. Putting them into a directory three with about three levels will reduce the server load, but it will still be a major complication in your program. I agree with the other people who said you should use a database. Yes, in a way a database is doing what you planned to do. However, the people who programmed that database spent a lot of time and used many sophisticated algorithms to get the database to do this stuff efficiently. Realisticaly speaking, if you use a database, this is maybe a couple of hours work (mostly spent in learning the basics of SQL :-). If you decide on programming the whole thing yourself, you will spend days to weeks getting it right. If the cost of 4GB of space was a concern, what is the cost of several weeks of your work? What is the cost of the work of whoever has to maintain that system after you're done implementing it? Use a database	[reply]
Re^2: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?) by rjahrman (Scribe) on Jul 24, 2004 at 07:36 UTC
The 1M was a number that I threw out. Realistically, that number will be much, much higher. While SQL is usually great, it is simply not what I need for the final product in this situation. What I need to do with the data is very specific and isn't what database engines are optimized for. Just to clear that up. :) I will most likely build the database in SQL, and then convert it to a flat-file. Thanks for the help!	[reply]
Re: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help) by BrowserUk (Patriarch) on Jul 24, 2004 at 15:47 UTC
Using a database (whether RDBMS or other) won't help you either save diskspace or improve performance. If you write your binary data to them as BLOBs of some type, where each BLOB represents one file. Each blob will, in most DBs, be stored either as a separate file within the host filing system. A million files, a million clustering roundups. No savings. Or as a fixed size (maximum size for the type of BLOB) chunk within a larger file. Thus, effectively making the cluster size, whatever the maximum size is for the largest file you expect to store. If you store your numbers as individual rows in a table per file. You will have a million tables, which often as not translates to a million files in the host filing system. But worse, to be able to retrieve those numbers by position, will require a second field in each row to record the position within the file. Thus, at least doubling the space requirement. More if you actually make that position field an index to speed access. Building your own index is equally unlikely to help. It takes at least a 4-byte integer to index a 4-byte integer. Plus some way of indicating which file each belongs to. With a million files, that a least 20 bits per. And you still have to store the data. I would use a single file with a fixed size chunk allocated to each file and store this in a compressing filesystem. (Or a sparse filesystem if you have one available.) I just wrote a 1_000_000 x 4096 byte records, each containing a random number (0--1023) of random integers. The notionally 3.81 GB (4,096,000,000) file, actually occupies 2.42 GB of disc space. So even though potentially half of every 'file' is empty, the compression compenates. It runs somewhat more slowly both the initial creation (I preallocated continguous space), and random access, than an uncompressed file, but not by much thanks to filesystem buffering. In any case, it will be considerably quicker than access via a RDBMS. Even if your files can vary widely in used size, nulling the whole file before you start will allow the compression mechanism to reduce the 'wasted' space to a minimum. A 10 GB file containing only nulls requires less that 40MB to store. The best bit is that using a single file saves a million directory entries in the filesystem, and having to juggle a million filehandles with associated system buffers and data structures in RAM. A nice saving. You will have to remember the 'append point' for each of the files, but that is just a million 4/8 bytes numbers. A single file of 4/8 MB. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon	[reply]
Re^2: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help) by tilly (Archbishop) on Jul 24, 2004 at 16:32 UTC
My expectation is that most databases would use a well-known datastructure (such as a BTree) to store this kind of data. Which avoids a million directory entries, and also allows for variable length data. I admit that an RDBMS might do this wrong. But I'd expect most of them to get it right first try. Certainly BerkeleyDB will. As for the "file with big holes" approach, only some filesystems implement that. Furthermore depending on how Perl was compiled and what OS you're on, you may have a fixed 2 GB limit on file sizes. With real data, that is a barrier that you're probably not going to hit. With your approach, the file's size will always be a worst case. (And if your assumption on the size of a record is violated, you'll be in trouble - you've recreated the problem of the second situation that you complained about in point 1.) I'd also be curious to see the relative performance with real data between, say, BerkeleyDB and "big file with holes". I could see it coming out either way. However I'd prefer BerkeleyDB because I'm more confident that it will work on any platform, because it is more flexible (you aren't limited to numerical offsets) and because it doesn't have the record-size limitation.	[reply]
Re^3: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help) by bgreenlee (Friar) on Jul 24, 2004 at 17:36 UTC
A 2GB filesize limit is definitely a problem with the big file approach. Two possible ways to avoid this if you still want to go this way: - the obvious: split the big file up into n files. This would also make the "growing" operation less expensive - if some some subfiles aren't growing very much at all, you could actually decrease the size allocated to them at the same time you do the grow operation. Actually, if you wanted to get really spiffy, you could have it automatically split the big file in half when it hits some threshold...then split any sub-big files as they hit the threshold, etc... BerkeleyDB is definitely sounding easier...but I still think this would be a lot of fun to write! (Might be a good Meditation topic...there are times when you might want to just DIY because it would be fun and/or a good learning experience.) Brad	[reply]
Re^3: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help) by BrowserUk (Patriarch) on Jul 25, 2004 at 11:22 UTC
My expectation is that most databases would use a well-known datastructure (such as a BTree) to store this kind of data. Which avoids a million directory entries, and also allows for variable length data. I admit that an RDBMS might do this wrong. But I'd expect most of them to get it right first try. Certainly BerkeleyDB will. Using DB_File: 512,000,000 numbers appended randomly to one of 1,000,000 records indexed by `pack 'N', $fileno` Actual data stored (1000000 * 512 * 4) : 1.90 GB Total filesize on disk : 4.70 GB Total runtime (projected based on 1%) : 47 hours 512,000,000 numbers written one per record, indexed by `pack 'NN', $fileno, $position` (0..999,999 / 0 .. 512 (ave)). Actual data stored (1000000 * 512 * 4) : 1.90 GB Total filesize on disk : 17.00 GB (Estimate) Total runtime (projected based on 1%) : 80 hours* (default settings) Total runtime (projected based on 1%) : 36 hours* ( cachesize => 100_000_000 ) (*) Projections based on 1% probably grossly under-estimate total runtime as it was observed that even at these low levels of fill, each new .1% required longer than the previous. Further, I left the latter test running while I slept. It had reached 29.1% prior to leaving it. 5 hours later it had reached 31.7%. I suspect that it might never complete. Essentially, this bears out exactly what I predicted at Re: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help). Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon	[reply] [d/l] [select]
Re^4: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help) by Your Mother (Archbishop) on Jul 27, 2004 at 22:09 UTC
Re^5: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help) by BrowserUk (Patriarch) on Jul 27, 2004 at 22:44 UTC
Some notes below your chosen depth have not been shown here
Re^3: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help) by BrowserUk (Patriarch) on Jul 24, 2004 at 17:35 UTC
Care to offer some code for comparison? Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon	[reply]
Re^4: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help) by tilly (Archbishop) on Jul 24, 2004 at 20:19 UTC
Re^2: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help) by demerphq (Chancellor) on Jul 25, 2004 at 23:01 UTC
Im confused, why wouldnt you just use a single table, with file_num,item_num and num_val as the data? Presuming that we can use four bytes per field we have 12 bytes per record. Thus 1 million records is ~12MB, assuming 100 records per file, we are looking at 120 MB no? My point here is that unless Im missing something (which i suspect I am) that neither of the ways you describe is how I would solve this problem with an RDBMS engine. BLOBs are a bad idea as they almost always allocate a full page (one cluster iirc) regardless of how big the BLOB is. And using millions of tables just seems bizarre as the overheads of managing the tables will be ridiculous. I suspect, but dont know for sure that Sybase would be very unhappy with a DB with a million tables in it, but i know for sure that it is quite happy to have tables with 120 million records in them. --- demerphq _{First they ignore you, then they laugh at you, then they fight you, then you win. -- Gandhi}	[reply]
Re^3: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help) by BrowserUk (Patriarch) on Jul 26, 2004 at 00:22 UTC
As described by the OP, there are 1,000,000(+) binary files containing (a variable number of) 4-byte integers often less than 1kb, and usually less than 4kb. Assuming an average of 2kb/512 integers per file that gives 2,0481,000,000 = 1.9 GB. The aim was to save 'wasted disc space' due to clustersize round-up. Any DB scheme that uses a single table and 2x 4-byte integer indices per number will require (minimum) 12 512 * 1,000,000 = 5.7 GB. The extra space is required because the two indices, fileno & itemno(position) are implicite in the original scheme, but must be explicit in the 'one table/one number per tuple' scheme. The other alternative I posed was to store the each file (1..1024 4-byte integers) from the filesystem scheme as LONGBLOBs thereby packing 1 file per tuple in the single table. Often BLOBS are stored as fixed length records, each occupying the maximum record size allowed regradless of the length actually stored. Even when they are stored as LONGVARBINARY (4-byte length+length bytes) they are not stored in the main table file, but in separate file with a 4-byte placeholder/pointer into the ancillary file. That's at least 12-bytes/file (fileno, pointer, length) * 1,000,000 extra bytes that need to be stored on disc somewhere. Any savings made through avoiding cluster round-up by packing the variable length records into a single file are mostly lost here and in the main table file. In addition as the OP pointed out, this sceme requires that each 'file' record be queried, appended to, and then re-written for each number added. A costly process relative to appending to the end of a named file. It's often forgotten that ultimately data stored in a database end's up in the filesystem (in most cases). Of course, in a corporate environment, that disc space may belong to someone else's budget and is therefore not a concern :) But if the aim is to save disc space (which may or may not be a legitimate concern--we don't know the OP's situation. Embedded systems?), then a DB won't help. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon	[reply]
Re^3: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help) by mpeppler (Vicar) on Jul 26, 2004 at 10:58 UTC
Sybase could handle a million tables, but, as you say, the overhead (in syscolumns and sysobjects) would be tremendous. BLOBS would be a bad idea from the space management perspective, and would probably be a bit slow as well due to being stored on a different page chain. If you are using Sybase 12.5 or later and you know that the binary data will be less than a set amount (say 4k or so) then you could use a 4k or 8k page size on the server, and use a VARBINARY(4000) (for example) to store the binary data. This would be quite fast as it is stored on the main page for the row, and wouldn't waste any space. Michael	[reply]
Re: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?) by bgreenlee (Friar) on Jul 24, 2004 at 06:34 UTC
I would probably find an appropriate database to use as well, but if you want to DIY, one thing you could try is to create, as Dave suggested, one large sparse file with space pre-allocated for each "subfile", and then grow that file periodically (by creating a new, larger sparse file and copying the data from the old file into that one) as your subfiles start to fill up. If your subfiles grow at the same rate, you can uniformly increase the main file size (e.g. double it); otherwise, you'll need to come up with a reasonable algorithm for determining how much to grow the file when a subfile fills up (e.g. you probably don't just want to increase the size of just that one subfile, otherwise you'll be doing this expensive operation more than you'd like; you might at the same time increase the size of any other subfiles that are over a certain threshold full). Anyway, sounds like a fun project to hack around with. Good luck. Brad	[reply]
Re: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?) by Your Mother (Archbishop) on Jul 24, 2004 at 06:24 UTC
The fact that the files are built at the same time! For every integer that is added in the middle of the mega-file (e.g. all of them), the location of every sub-file would have to be changed! Do the sub-files' positions really need to change, or is it just the way you're envisioning it would have to be? If they don't need to, maybe a DB_RECNO DB_File is what you need. It would be auto-indexing and you could take sizes of individual records easily. You can splice in new pieces as well as push'ing and unshift'ing from the ends. There is some related file tying goodness in Conway's OOP book, ISBN 1884777791, (his genetic array in particular might be interesting for this).	[reply]
Re: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?) by dragonchild (Archbishop) on Jul 24, 2004 at 21:08 UTC
After reading through the problem and offered solutions ... I have a few thoughts: 4KB standard cluster size. That very phrase implies you unconciously know that cluster size can be changed. I had always been under the impression that block size was 512 bytes and that you could alter your cluster:block ratio from 8:1 to 1:1. .5K should do you just fine, right? The inode issue is going to be your bigger issue. Potentially, you might want to partition a bunch of times, but that's annoying to manage. Your entire problem statement is poorly defined. Granted, there is obviously information you don't feel that you can post on a public board, and that information would probably fill some of the gaps I have in my understanding. But, I think your problem really boils down to the following: You need N buckets, where N is some really really large number. Each bucket will contain some number of values each taking up 4 bytes of space. Each bucket may contain a different number of these values. The maximum number of values in any bucket will be M. Each bucket has a unique identity. The buckets will be built on an infrequent basis. The buckets will be searched on a very frequent basis. Finding the right bucket as fast as possible is the overriding priority. Reducing the space taken for this solution is a very high priority. I am assuming the 4th statement, but it's a reasonable assumption. M may be very large relative to the average number in any bucket, but that's ok. This would mean you need, at most, 4 * M * N bytes. But, there is often a size limitation in how big a given file can be. Let's say that value is 2GB (a common value, though not the only one). So, you would need (4 * M * N) / (2 * 1024 * 1024 * 1024) files (rounded up, of course). Oh, and an index file. This file would need to be (4 + 1 + 4)* N bytes long. 4 for the bucket identifier, 1 for the file identifier, and 4 for the location to seek to in the file. I'm assuming that you will never have more than 256 files, because that would mean you had, potentially, a half-terabyte of data you were working with. If that was the case, you wouldn't be complaining about size considerations. I'm also assuming you don't have more than 238,609,294 buckets. If you did, you would need a second index file. Given all that, you could take the following approach, if you had a lot of temp space to work in. Generate your data, using a fixed record size of M values. Let's say M is 1024. This would mean you would fit 512 * 1024 records in each 2GB file. So, if you had 5 million records, you would end up with 10 2GB files - a total of 20GB. Do not generate the index file yet. After you've generated your data, go through it all again. Now, you're going to compact your files as well as generate your index. I'm going to assume you have a value that isn't legal that you can use as a flag. Let's say it's -1, which is usually represented by 0xFFFF...FFFF. So: Read in the next record. (sysread() the next 4096 bytes.) Strip off the trailing 0xFFFF...FFFF bytes. See if you have enough space in the compacted file you're writing to. If you do, write to it and update your index. If you don't, close that file, open a new one, increment the fileno in your index, and write it out. Once you're done reading a file, delete it from temp. If you average 150 values per bucket, this would translate to (4 * 150 * 5,000,000), or 2 data files and one index file. One data file would be right around 2GB and the other would be roughly 813MB. The index file would end up being (9 * 5,000,000) bytes, or a little under 43MB. This, however, does require around 22GB of temp space, at the worst case. Depending on how sparse the first file you compact is, it may require closer to 20.2GB. It also requires a bunch of processing time. The compacting process would probably take, given a dual 2.6 Xeon with 4GB of RAM and not much else on the box ... a couple hours. The writing portion of the generating process wouldn't be very bad at all. ------ We are the carpenters and bricklayers of the Information Age. Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose I shouldn't have to say this, but any code, unless otherwise stated, is untested	[reply]
Re^2: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?) by mhi (Friar) on Jul 24, 2004 at 22:12 UTC
I may have missed something here and therefore the following approach might be oversimplified. I'd write the data to a flat file in the first pass, with the file structure being lines with key-value-pairs. The key would represent the "filename" and the value one of the "4-byte" values of the OP. Make sure a new file is started before the max filelength for the OS or the FS is reached. If the "filename" is too long, I would create a separate file mapping each "filename" to a shorter key. Obviously each key will occur as many times as there are values for it, each time on a separate line. The order of the values (should they matter) will be preserved in the order of the lines. In the second pass, once all the values have been written to the file(set), analyze it once for each key and write all of the values per key into a single arbitrary-length record of a new target file(set). In the third pass, create the index on the target file(set). In this way the first pass file(set) will accept values for keys in any order, appending them to the end of the file and will not waste space for large records that won't be needed most of the time. The second pass will take a whole lot of time, but as I understand it time is not the issue here. Generally, if space is a major consideration, a DBMS is the last thing I would look at. There's just too much overhead there, in order to make it work with all kinds of data structures. Update: Corrected spelling mistake. Added Comment on DB.	[reply]
Re: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?) by thor (Priest) on Jul 24, 2004 at 13:40 UTC
From what I remember about the "bad old days" when disks were smaller, the block size isn't a static 4 Kb. It depends on the size of the disk that you're using. You could try partitioning the disk that you have into 2 (or more) smaller partitions. This could reduce the block size on each individual partition, thus reducing waste overall. thor	[reply]
Re^2: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?) by superfrink (Curate) on Jul 24, 2004 at 17:58 UTC
I believe when you create an ext2 filesystem you can select the block size to be 1k, 2k or, 4k. Also maybe look into ReiserFS which I seem to recall will pack multiple small files into the same block to save space. Also of note is if you are using ext2 (common in Linux) or FFS (OpenBSD) and probably others you have to be concerned with the inode count ("Index NODE"). Every file on those systems needs an index node. If you have too many files you won't be able to create any more even though you have free disk space. Use the command "df -i" to see how many inode your partitions have free. I've seen it happen more than once and when you run out of inodes and try to create a file you will get a "No space left on device" error message. Also note that ReiserFS does not use inodes. Instead it uses a balanced tree. Balanced trees are faster to search when you a large number of entries than a linear list (which I believe ext2 uses for filenames in a directory). All that said I don't really understand how your data will be generated and accessed so I'd also probably default to suggesting a database like others above have said. That's just because it's eaiser to let some already developed and tested code do the work for me. I guess what I mean sure I can write linked lists in C with pointers but Perl can cut down my development time because lists are a given part of the language. PS: I don't want to start a language war. I don't want to talk about STL. I was just mentioning that using existing tools can help me finish my work faster.	[reply]
Re: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?) by Dr. Mu (Hermit) on Jul 24, 2004 at 17:00 UTC
You may find this passionately-discussed thread regarding writing to the top of a file (#354830) useful here as well. Among the ideas presented are: Using linked lists to avoid moving stuff around once it's written to a file, and Using a filesystem that doesn't rely on clustering.	[reply]
Re: Combining Ultra-Dynamic Files to Avoid Clustering (A better way?) by BrowserUk (Patriarch) on Jul 25, 2004 at 11:35 UTC
Rather than writing 1,000,000 files x 4096-bytes, turn the problem around. Write 1024 files x 4,000,000 bytes. The 'file' number x 4 becomes the offset into the file. The 'file position', becomes the file number. This addresses both the > 2 GB problem and the 'maximum filesize assumption' problem. Many of the 1024 files would be sparsly populated, but from what I read, XFS and reiserFS support this for Linux and placing the files in a compressed directory would deal with that on Win32. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon	[reply]
Re: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?) by Anonymous Monk on Jul 25, 2004 at 03:16 UTC
Personally, I would let someone else do the heavy lifting and just use reiserfs	[reply]
Re: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?) by TomDLux (Vicar) on Jul 25, 2004 at 02:06 UTC
If you really want to do things by yourself .... Get a honking big file, say 4GB, and a smaller file, say 4 MB. To manipulate file 17623, seek location 17623 * 4 in the smaller file, and read in the 4 byte word you find there, call it <offset>. Seek location <offset> in the larger file, and read bytes until you reach the EOF marker. Alternately, you could store a 2 byte length in the index file. If you need to enlarge file N, copy it to the end of the large file, where you can do whatever you want. You can keep an index of unused 'holes', and move files that fit into the hole instead of to the end of the file. When the file runs out of space, compact the file to a new file, generating a new index. On the other hand, DB companies spend millions of dollars and devote hundreds of employees to maximize the efficiency of such operations. Can you really out-perform them? -- `TTTATCGGTCGTTATATAGATGTTTGCA`	[reply]


Clear questions and runnable code get the best and fastest answer
	PerlMonks