Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^2: Perl solution for storage of large number of small files

by isync (Hermit)
on Apr 30, 2007 at 10:01 UTC ( #612726=note: print w/replies, xml ) Need Help??


in reply to Re: Perl solution for storage of large number of small files
in thread Perl solution for storage of large number of small files

Been there, done that. Actually for the meta-data index of the heavy-load storage...
The first incarnation was a DBM:mldb. The second version sqlite, with which I ran into a heavy disk IO overhead inserting/updating meta-data, now the index is in-memory as plain data structure...

So, do you actually recommend sqlite as storage for binary data?
  • Comment on Re^2: Perl solution for storage of large number of small files

Replies are listed 'Best First'.
Re^3: Perl solution for storage of large number of small files
by salva (Abbot) on Apr 30, 2007 at 10:56 UTC
    So, do you actually recommend sqlite as storage for binary data?

    Well, I don't recommend neither disrecommend it. I was only suggesting you should try another backend!

    Which database is the best for a given problem, does not depend exclusively on the data structures but also on the usage pattern.

    Anyway, if you need to access 2GB of data randomly, there is probably nothing you can do to stop disk trashing other than adding more RAM to your machine, so that all the disk sectors used for the database remain cached.

      Hi isync and salva, interesting topic.

      Anyway, if you need to access 2GB of data randomly, there is probably nothing you can do to stop disk trashing other than adding more RAM to your machine, so that all the disk sectors used for the database remain cached.

      In this situation - more data than memory, but not loads more - I've found memory mapping works well. In my situation the data accesses were randomly scattered but with a non-uniform distribution - if that makes sense. I.e. although the access wasn't sequential, some data was accessed more often than others. So memory mapping meant that the often-access data stayed cached in ram.

      Any decent database should be able to do pretty much the same thing - as long as you configure it with a big query cache - although disk access will be slower than for memory mapping.

      The real problem comes if you're making a lot of changes to the data, which busts your cache...

      Best wishes, andye

        Often-accessed data will stay in memory whether it is accessed via read() or mmap(). mmap() can be a more convenient interface, precisely because of the opposite effect, data on disk mapped by mmap() *isn't* automatically brought into memory until it is used, and then only the bits which are needed are brought in (subject to 4k page granularity). Whereas a successful read() will always bring the data to memory.

        This means you are perhaps less likely to have unwanted data in memory, but that's more to do with it taking more code to do the read() approach well than because mmap()'d data is more likely to stick in memory.

        The kernel might trigger different heuristics for the two different methods of access (such as readahead if you do a number of sequential reads or a big sequential memory access to an mmap'd area), but I'm not even sure of that - they might go through exactly the same code paths.

        I'd say that the biggest difference is the results of a read() are normally copied into a per-process buffer in the application, whereas multiple processes can in principle share the same copy of mmap'd data.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://612726]
help
Chatterbox?
and the fire pops...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2018-06-18 04:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?



    Results (107 votes). Check out past polls.

    Notices?