Re^3: Am I on the right track?

I'm not sure I understand your question. It looks to me like the OP code opens every input file two times, once to get its page count, and once to append its content to a chosen output file. (Your "get_page_cnt()" sub only returns a group name, not a file handle or pdf content.) My suggestion is no different in that regard.

Where my suggestion differs is that all the inputs are scanned first, before any output is done, and then there's a nested loop: for each output "group" file, create it, then for each input file in that group, concatenate its content. (If none of the inputs fall into a given group, there's no need to create an output file for that group.)

Opening and closing each output file exactly once is bound to involve less overhead on the whole, compared to randomly closing and reopening output files (but I have no idea whether the difference will be noticeable in wall-clock time).

Another thing to consider is whether you have to worry about an upper bound on the amount of data you can concatenate into one pdf file for a single print job. If so, I think my suggested approach would make it easier to manage that, because you can work out all the arithmetic for partitioning before creating any outputs.

Comment on Re^3: Am I on the right track?

Replies are listed 'Best First'.
Re^4: Am I on the right track? by Pharazon (Acolyte) on Jul 13, 2015 at 17:08 UTC
It's possible that my code isn't doing what I think that it is, but I don't think I am opening the input file twice. If I'm correct in my thinking (based on my interpretation of the modules documentation) what it does is open an input pdf and check to see how many pages it has. Then it looks to see if the output file that is currently open (if there is one) is the one that this pdf needs to be added to. If yes then it appends the input to the output and moves to the next input. If not it writes the output that is currently open (clearing it from memory and clearing the stream of data that was being built) and then opens the correct output file and appends the input to output and moves to the next input. If the input files belong to the same output file in succession then the output file stays open having data added to its stream (term might not be right) until such time that an input file that goes elsewhere is opened at which time it will be written. So I should only be opening the input files once, but assuming every other file was supposed to go in a different output file then I would be doing a lot of opening closing of those files, though realistically large swaths of the data belong in the 1 and 2 page file with the others being more sporadic. I am going to test the hash method you mentioned regardless but I hope that makes my intentions and code a bit clearer.	[reply]
Re^5: Am I on the right track? by graff (Chancellor) on Jul 18, 2015 at 01:28 UTC
It's possible that my code isn't doing what I think that it is, but I don't think I am opening the input file twice. When you do this: `my $group = get_page_cnt($append_pdf);` [download] a file handle is created, the data file is opened for reading, some amount of it (I don't know how much) is read into memory and processed in some way, the file handle is closed, and the result of processing is returned as a scalar value. Then, when you do this: `my $otherpdf = CAM::PDF->new($dir."data2\\".$append_pdf); $pdf->appendPDF($otherpdf);` [download] another file handle is created, that same input file is opened and read, and this time its contents are copied to an output file. In this latter case, it's very likely that the OS and/or system hardware have cached the file and its contents, so the second open/read/close will almost certainly be faster than the first one, but there are still two instances of opening the file and linking it to a file handle for access. Meanwhile, there's the separate issue of how many times a given output file needs to be closed and then reopened, depending on how the particular ordering of input file names (as returned by "readdir") relates to the page counts of the files.	[reply] [d/l] [select]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks