I would wonder why you think threading would be a good approach?
I have a program that does, something similar. I wanted to compare version numbers of "rpm"s against version numbers as reported by RPM, of all other rpms that have the same base name.
In my case, the multiple lists are many but small. I.e. I can have 5-10K different rpm names, but usually only 1-3 different versions (assuming it's been pruned regularly).
The thing that takes time in my circumstance was calling "rpm" with a query that gives me it's idea how it split the Name from the Version and Release. Calling rpm or it's library, would still involve it opening the rpm package on disk, to parse it's header.
As a result, much time is spent in disk waits -- and a bit of experimentation on a 12-core over a RAID50 showed me that scheduling about 9 inspection threads/procs at a time yield the least amount of overall time (though the speedup is only between 3-6X so it may not be the most efficient in terms of cpu usage, but I was looking for real-time benefits).
So I make sure my list is sorted by names and split my list by # procs to allow. They each go off and run through their list, and when done, report back to the master -- sending
back their reduced lists and a second list of rpmnames that are 'redundant' (to be removed).
I also make sure that the minimum amount of work for each worker is at least "X" queries. If it isn't, I reduce
the number of overall workers.
In development, I found it useful to write the results
to temporary space before it was re-merged by the parent,
but after it was working, I went to using named pipes. Note..
I DID use /dev/shm for the location of the temporary space, so it was, in some respects, still IPC, but I could examine the results by the files created in /dev/shm if I wanted to, during development.
I didn't see that threading offered any advantage over multiple procs, since perl threads are really separate procs
anyway, and, at least for me, being able to send the child output to an intermediate space during development was real helpful. Since The child and parent both just used "FD"'s, it was trivial to switch them to directly talking over pipes once the development was done.
A larger than normal run (took two different distro releases and combined and ran them through). Looks like:
> time remove-oldver-rpms-in-dir.pl
Read 35841 rpm names.
Use 9 procs w/3984 items/process
#pkgs=21111, #deletes=14730, total=35841
2 additional duplicates found in last pass
Recycling 14732 duplicates...Done
Cumulative This Phase ID
0.000s 0.000s Init
0.000s 0.000s start_program
0.060s 0.060s starting_children
0.065s 0.005s end_starting_children
118.643s 118.578s endRdFrmChldrn_n_start_re_sort
123.202s 4.559s afterFinalSort
202.70sec 18.16usr 64.72sys (40.89% cpu)
The final difference from 123-202 was spent in moving
each of the files to a per-disk recycle bin, that I periodically empty via another script.
For me, the use of threads would have complicated things.
Does that give you any ideas?
linda
----
P.S.--
If speed was really important, I could likely benefit by using multiple cores on that final step that take >100 secs, as it's all single threaded. I could probably 'rename' at least 3-5 files in parallel ... but it's just a maintenance script, so not a real high priority...
|