Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Caching Format

by hok_si_la (Curate)
on Jan 11, 2012 at 17:26 UTC ( #947386=perlquestion: print w/replies, xml ) Need Help??

hok_si_la has asked for the wisdom of the Perl Monks concerning the following question:

Good localtime friends,

I recently inherited a project at work that requires storing information about collections of images. Essentially each image that comes through has the following attributes:

1)Name of the collection it belongs to
2)Number of images in collection
3)Image number _ of _ in collection
4)File name of image

As these come in I need to store the information above until the final image of the collection is processed. I was thinking about creating a file (collectionname.dat) using a format like ...

Number of images => 5 1 => image1.jpg 2 => image2.jpg 4 => image4.jpg 5 => image5.jpg

So in the example above I have not received the third image of the collection.Each time I receive an image I have to perform the following:

1)Check for existence of dat file. Create one if required
2)lock the file (concurrency)
3)check to see if the latest image completes the collection
  3a)If not, write the image information to a file
  3b)If it does, send the collection out and delete my dat file

Now on to the question. Any ideas on suggestions to make this quicker and easier? Currently the use of a DB is not an option.


Replies are listed 'Best First'.
Re: Caching Format
by moritz (Cardinal) on Jan 11, 2012 at 17:41 UTC

    You could just keep the data in memory... or is there a reason not to?

    Now on to the question. Any ideas on suggestions to make this quicker and easier?

    Quicker to program or quicker to run?

    If you want it quicker to run, you could just put the data on a RAM disc instead of a hard disc. If you want it easier to program, you could just use Perl data structures and JSON::XS or Storable for serializing and deserializing.

    Currently the use of a DB is not an option.

    And what is an option? How much control do you have over your environment? What about non-databases like memcached?

      Thanks everyone for the input and suggestions.

      The environment I am working with is accredited (read "takes and act of god to add/remove software requirements such as SQL, Postgres", etc) and hardened. Additionally it is hosted on one of our customer's systems. So short answer concerning environment, I have very little control. The images that are being sent are being sent to this one vendor from multiple hands, all with a defined file format, so unless I decide to rename everything I need to store additional info. I should also mention I plan to write a web app to run reports on submitted/non-submitted/incomplete collections. Using memory might be an option, though I do not know how long I need to make sure this information is available. I do know that 10s of thousands come in daily, so I guess I will have to weigh in on File IO choking vs hogging resources.


Re: Caching Format
by Eliya (Vicar) on Jan 11, 2012 at 18:06 UTC

    If I'm understanding you correctly, you could do away with the extra file (and the need for locking) by putting the info in the filenames themselves.  For example


    where CollectionID is a unique collection identifier, N is the expected total number of images, and IDX the image number within the collection (the collection ID could also be a directory).  All you then have to do after having received a new image is a simple glob plus a check for completeness.

    Update: forgot to mention that to avoid potential concurrency issues (reading yet incompletely written files), you'd rename a file to its final name only after you've finished writing it.

      It sounds like there are multiple collections in use at the same time, so it should be worth storing all in files for one collection in it's own directory. That will make examining the files quicker. But if there are thousands of files then sub directories could be help too.

       collectionNN/C000/files[0-99] collectionNN/C100/files[100-199] etc ...
        Hello RichardK, friends,

        Sorry for answering via a shotgun message. First, again thanks for your time and help. After I posted my first reply I began playing around with and fell in love with it. It was simple and handled the metadata caching format nicely. Collection entries could be pushed/popped as hash key=>value pairs. It also handled file locking and provided many methods to do all of the things I needed to do. Unfortunately I found out later from my boss that not only are Dbases not allowed, but any Perl Module that is not a Perl5 core module cannot be used either. Mulligan!

        Regarding the heap vs files debate; I learned that the required level of persistence is actually quite high, certainly high enough to warrant the use of a Dbase if that was an option. Essentially collections will be kept indefinitely. That is the reason I chose to use files. I also found out for certain that I could not modify file names. As of now I plan on creating a pseudo-namespace for each collection by throwing collection metadata and files unique directories.


        P.S. I used a lot of buzzwords and somehow left out "Cloud" so there I said it.
Re: Caching Format
by RichardK (Parson) on Jan 11, 2012 at 18:03 UTC

      I second that motion.   SQLite isn’t an SQL database server.   What it is, is a public-domain(!) single file format which supports a rather full SQL-database model within that file (or files), including reliable transactions and therefore locking and known-good file sharing.   You don’t have to install anything beyond a package.   Since the database is “just a disk file,” in the same way that the file that you are now contemplating is just a disk file, this approach would give you tremendous “bang for your buck” and I would argue that no proprietary approvals of any kind would be necessary ... even in the most “hardened” business environment.   You are storing the data “in a file,” as originally contemplated, but now that file happens to be an extremely smart file.   Furthermore, the odds that SQLite is already there, even for run-of-the-mill purposes like hosting the cpan modules database, are pretty near 100%.

      P.S.:   Yes, I said public domain.   There is no license.

Re: Caching Format
by oko1 (Deacon) on Jan 12, 2012 at 04:06 UTC

    Perl comes with a number of DB-like access methods; see the listing in AnyDBM_File. And you just might have YAML installed - a number of other modules call for it - which would make life just peachy (stores the data in a cleverly-arranged text file, essentially.) Write a simple script that prompts you for the above data and rolls it into YAML as a hash, then retrieve it whenever needed.

    #!/usr/bin/perl use warnings; use strict; use YAML 'DumpFile'; $|++; # Cheating in a bit of data here... my %images; @images{1..5} = map "image$_.jpg", 1..5; # Voila! DumpFile("yaml.db", \%images);
    Education is not the filling of a pail, but the lighting of a fire.
     -- W. B. Yeats
Re: Caching Format
by FloydATC (Deacon) on Jan 13, 2012 at 13:59 UTC
    If even SQLite is overkill, how about just using Storable? I've used this in a couple of places where I just wanted my script to "remember" a single hash between runs and didn't have to worry about concurrency etc. All it really does is save you the trouble of formatting/parsing the file.

    use Storable; store \%table, 'file'; $hashref = retrieve('file');

    -- Time flies when you don't know what you're doing

      Just keep in mind that Storable is specific to the perl version (version number + configuration + platform) it was compiled against. Changing the perl version may cause trouble. So Storable is ok for local temp data, but not as a transport format.


      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://947386]
Approved by moritz
Front-paged by ww
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2021-11-27 01:48 GMT
Find Nodes?
    Voting Booth?

    No recent polls found