Voronich has asked for the wisdom of the Perl Monks concerning the following question:

I was bitching a blue streak about this issue in CB and realized I needed input on this, just in case my head's not on straight.

I get 10 files a month. They generally report the ongoing state of a list of accounts, each a monthly snapshot.

For the sake of argument, let's assume I can cobble together a primary key for these files, be it an explicit account number or a string composite of fields.

I'm merging these things into one table that looks more or less as follows:

table account_stuff account_id int, source_file varchar, host_key varchar(huge), open_date date, close_date date
The idea is to create generic account ids (a new field that I'm generating) to correspond to each account across the 10 files as needed, based on a monthly load proccess.

So the open_date corresponds to either "the date that the account started showing up in the file" or "the open date specified in the file" (if one exists.)

Close date is as you'd figure, optimistically the 'specified close date' or, more interestingly "the last month in which the account appeared on an input file."

I'm cobbling this together now. But the rough pseudocode in my head seems a bit gross for reasons I can't quite put my finger on. It smells funny:

- For each input file that's loaded - Check it's list of primary keys against the account_stuff host_key for a match. - If there's not a match, add a row. - Check the list of account_stuff 'source_file's host_keys against t +he imported file. - If there are missing rows, update with a close_date
So basically it looks like "left outer join, test for nulls, then right outer join and test for nulls."

I was going to drive this entirely in perl, but it's starting to look like I could slap it into a series of stored procs.

What am I missing here?

Replies are listed 'Best First'.
Re: Assigning account numbers, intuiting open and close dates
by InfiniteSilence (Curate) on Apr 09, 2014 at 18:46 UTC
    I think what you are missing is a viable workflow that accounts for errors, missing information, etc. My approach would be,
    • Grab all data and attempt to import it into a temporary table without no keys and three additional fields: 1) a flag that says 'processed|unprocessed|failed' that defaults to 'unprocessed' and 2) a unique account id 3) Long error message
    • Run a utility (who cares what it is written in -- it can be a stored proc if you prefer) that scans the prospective record and fills in the blanks -- the account id. If no suitable account id can be produced flag the record as 'failed' and update the long error message field with the reason. If it passes move the record to its final destination.
    • Finally, log every transaction with a timestamp when it was processed, how long it took to run, etc.
    You'll also need a hardfile log to capture instances when the database is unavailable or if somebody changes permissions on the tables so as to make them unwriteable/unreadable/etc.

    Celebrate Intellectual Diversity

      Well yeah. A lot of that I just left out because it's not particularly germane to the problem I'm trying to solve. I've got standard process boilerplate surrounding this thing.

      processed/unprocessed doesn't have sustained value over time, as a matching run will produce simple deterministic results and there will either be data or not. Not means failed, and processed is a function of a monthly run being done or not. So there isn't sufficient justification to add additional columns to denote purely derivative information.

      Only collect data that can conceivably be useful. I have overall process timing and logging in my boilerplate perl. But most of that stuff isn't really useful.