Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

How to maximise the content of my data CD

by amaguk (Sexton)
on Feb 25, 2005 at 11:58 UTC ( #434432=perlquestion: print w/ replies, xml ) Need Help??
amaguk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I've some questions about a practical problem. I've a directory with a lot of files. I want to burn some CD with these files, but how can I maximise the content of each CD, and minimize the total number of CD ?

Is there an existing script (nothing found during my search, but...).

Is there a good algorithm ?

My first thinking are :
- number of CD = round up (total size in Mo / 700Mo)
- in CD1, I put the largest file if total size of CD1 is < 700Mo or on the following CD;
- in CD2, I put the second largest file if total size of CD2 is < 700Mo or on the following CD;
- in CD3, I put the third largest file if total size of CD3 is < 700Mo or on the following CD;
- in CDn, I put the nth largest file if total size of CDn is < 700Mo or on the following CD;
- and I loop on the first CD !
- If there is always files and all my n CD are full, I create a new CD where I put these files.

Is there a better algoritm (I guess yes ;)) ?

Thanks in advance

Comment on How to maximise the content of my data CD
Re: How to maximise the content of my data CD
by brian_d_foy (Abbot) on Feb 25, 2005 at 12:20 UTC

    You want a "multiple knapsack" or "Multiple Subset Sum" algorithm. Algorithm::Knapsack may be useful. Google for those and you're on your way. :)

    --
    brian d foy <bdfoy@cpan.org>
      Thanks, I've downloaded Algorithm::Knapsack, and I've found some links on the web.

        And, with Algorithm::Knapsack come a tool named filesack which do : The filesack program finds one or more subsets of files or directories with the maximum total size not exceeding a given size..

        And, it's exactly what I want

        Thank you guys !!!

Re: How to maximise the content of my data CD
by blazar (Canon) on Feb 25, 2005 at 12:56 UTC
    This question tends to come up quite often lately. As it has already been pointed out to you it's basically the knapsack problem, which is known to be generally a "hard" problem. However a practical answer may depend on the actual average file sixe: if you only have files whose size is about say 1Mb or less, or at least you have a good wealth of such files along with potentially larger ones, then you may be content with a suboptimal solution given by filling up the space with as many of those files as possible.

    As a side note, outside of France (for what I know) Mo is spelled Mb...

      Sorry for the Mb, I'm not French, but Belgian, and a French-speaking person.

      And thank you for your answer !

Re: How to maximise the content of my data CD
by inman (Curate) on Feb 25, 2005 at 12:58 UTC
    Sounds like homework to me ...

    You could always use an archiving tool to compress all of the files and span them to the desired media size. I brushed off a version of pkzipc (on windows) and had a play. The following command compresses the data and creates a number of 700Mb files suitable for dropping onto CD.
    pkzipc -add -span=1.44 c:save.zip *.doc

    Similar opportunities exist with tar on UNIX systems.

      I've already thinked to archive tools, but the problem is if I want to read a specific file on one disk. I must rebuild the archive and extract it. It's too much effort for one file.

      And it's not a homework ;) Just a practical problem (I've a lot of PDF files from articles, excerpt of books, etc and I want to store all these files on CD), and my curiosity to do this with efficience

      FYI, see also hjsplit, which is a graphical splitter tool...
      Updated: typo.


      ----
      Zak - the office
Re: How to maximise the content of my data CD
by Limbic~Region (Chancellor) on Feb 25, 2005 at 13:56 UTC
    amaguk,
    This is like the knapsack problem, but not quite. The difference is that you don't need to hit an exact target, you just need to not waste any more CDs then an exact target.

    For instance, you have 2GB worth of files. A perfect solution would have that fitting on 3 CDs with room to spare. As long as your solution doesn't require 4 CDs - you have sufficiently solved the problem. I got into a heated debate on this exact same problem in IRC some time ago and could have swore that I posted about it here - but can't find it. There is a recent similar thread (Burning ISOs to maximize DVD space), which mentions Algorithm::Bucketizer which I haven't tried myself. I would attempt the following:

    • $buckets = ($total_size / 700) + 1
    • Order files by size in descending order
    • Round robin files (1 per bucket)
    • When you encounter first file that will not fit, stay with that bucket but continue down the list until you find one that fits
    • On the next bucket, start back at the top of the file list
    • Wash, rinse, repeat
    I am pretty sure the method will work. I was going to test it but the person complaining in IRC wouldn't provide a list of file sizes for me to try it out on and I wasn't motivated enough to make some up.

    Cheers - L~R

    Update: As pointed out below, perfect solutions that exactly match (or even very nearly match) a whole number of CDs will wind up costing you 1 extra CD. That's why the +1 as the first bullet.

      Won't always work, though I can't prove how far off it'll be. Mainly, it should work pretty well when you have lots of extra space, but an approximate solution won't be "close enough" if you end up too close to exactly filling all the discs. Consider files of size 350, 349, 233, 232, and 231. You can fit them on 2 700 Mb discs (350+349, 233+232+231), but your algorithm will use 3 dics. (If you tried to use only 2, you'd end up with 350+233, 349+232, and the 231 wouldn't fit on either).

      What I can't prove without a lot more thought is whether you're ever going to be off by more than a single disc, and what can't be proven at all is whether that's close enough for real world purposes. (Since what's "acceptable" in the real world has to do with how long you're willing to wait for an answer vs. how much you care about that extra disc, and other factors.) But just know that the greedy approach won't only be suboptimal in theory, but it will also, sometimes, bleed over into an actual difference.

        Eimi Metamorphoumai,
        Won't always work,...

        What I didn't say explicitly, but was implied by my bullet points was that 1 disc is being added to account for perfect (or even near perfect fits). The knapsack problem is hard but we aren't trying to break encryption we are trying to save a few pennies on CDs. I don't think (though I could be wrong) that it will ever waste more than 1 disk. Too make matters more difficult, we aren't talking about a handful of files but more likely hundreds if not thousands. Let's say that the total size is an exact multiple of 1 CD. That means every single CD needs to be an exact match (which may not even be possible). Proving it can or can not might take a while (extreme sarcasm). Why not just go with a "good enough" solution?

        Update 2008-11-26: It turns out that this heuristic approach can be is much as 11/9 OPT + 1 bin (according to bin packing). While my experience has been that 1 extra is all you will ever need, it is possible to need more.

        Cheers - L~R

      How about file sizes of ($bucketsize / 2) + 1. That would be the worst case scenerio, and would take N buckets (where N is the number of files).

      Although the OP is not talking about files that large (350Mb PDF files are a little large :), your algorithm can fail in the case of worst case data.

      --MidLifeXis

        Just for thought:

        If it is for backup,
        then spread the data,
        and add an additional CD as the checksum CD unit

        ie: Write a perl based CD RAID 5
        then you could recover from a bad or missing CD...

        how many CD burners/PC's do you have available ??
        That would be the worst case scenerio, but I would not say that the algorithm would fail. It would be as accurate as already indicated. it would predict n+1 but in reality you would only need n.

        Zero
        MidLifeXis,
        If each file is ($bucketsize / 2) + 1, it means only 1 file can fit per CD with either method so mine still only wastes the 1 extra CD. I am failing to see how your worst case scenario would make my solution use more than 1 extra CD?

        Cheers - L~R

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://434432]
Approved by brian_d_foy
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (13)
As of 2014-07-25 19:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (174 votes), past polls