BrowserUK, Thanks for your offer, but I can't shared the code at this time. Even if I could, it is too large to share in mass via this site.
Your response got me thinking or re-thinking about what are the issues with my current codes set. My concerns, at least today until I learn more :), are two: 1.) Scalability (primarily memory), 2.) Program Architecture (extendability and maintainability).
From the scalability standpoint, my current code is broken into 7 discrete, or high level, job steps that are run sequentially. Within each job step, I use one or more pools of worker threads to parallelize the execution of portions of the job. My current threads based approach stores everything needed for the run of each step in shared memory. I chose this because I thought it to be the fastest approach for each step; however, this consumes a lot of memory and includes the overhead of additional perl interpreters for each thread. I do create the worker threads very early in each step; so I do minimize the amount of memory consumed by each instance. Ultimately, I want to assemble these 7 job stops in an asynchronous framework that will let data flow through each step as it is available and have the merge points, which are currently the completion of a step, happen on a more granualuar basis to allow the application run more like a peaceful running riven, than a tsunami.
From a program architecture standpoint, I am having to do a lot of low level thread and thread pool management in my code. As I progressed from writing the first job step to the 7th, I have developed a module that encapsulates this pattern; however, it isn't work that I am proud of or want to support. I have been through a number of thread pool management modules on CPAN. Most simply don't work, don't fully work, or don't work well with the current versions of perl. When I looked to migrating from threads to forks, I started getting into the challenges of shared memory IPC.
So, I spent some time back in CPAN and mucking around with some small functional tests. I think I have come up with an approach using two well known modules and one relatively new module that will allow me to address both of my concerns.
- Approach: Use a custom written, pseudo-event, main loop to initialize asynchronous process (i.e. forks ) to query web api.
- IPC::Lite will be used to construct and manage global variables need to store application state, information to be shared by process and job step data queues.
- Parallel::Forker will be used to construct and manage the level 1 job steps (i.e. manage steps to merge points).
- Parallel::Fork::BossWorkerAsync will be used to manage multiple process pools for discrete job steps (querying api in parallel, downloading files in parallel).
The use of forks and IPC::Lite should minimize my memory footprint and on the whole should be comparable in performance to threads. Parallel::Forker allows me to define a job step sequence that will let me create the merge points needed for process synchronization. Parallel::Fork::BossWorkerAsync will let me spin off pools of forks where I can parallelize as a pool.
I believe that this approach will let me get my code running where I can consider it production quality from a run and maintenance perspective. I remain interested in pursuing a full event-loop approach as think this will give me better extendability and further minimize the code footprint that I will have to maintain; however, either the state of the documentation for something like AnyEvent is too chopped for me to readily synergize how to use it, or I'm just not smart enough to pick it up and run with it. In any event, I need to spend some time working with it to get my confidence up sufficiently to give it a try.
Again, thanks in advance!