http://www.perlmonks.org?node_id=368857

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All:
This is going to be a rather long post so I have done my best to use readmore tags and organize the information.
Preface:
A problem that is circulating around at work is how to reproduce the functionality of a product that is going away. Due to the way things are structured, I am more of a "you go do X" guy than someone involved in the actual solving of the problem. Since I like problem solving I am going to work on this on my own time and see how it compares to the solution(s) proposed from the team. I have not consulted google beyond a cursory glance so there very well maybe freeware/shareware/commercial wheels out there.

I have come up with the following assumptions and requirements

Assumptions:

  • The first row of all files will be a header row
  • There will be a key field in each record that will never change
  • The solution will need to be very "user friendly" - perhaps even a GUI

Requirements:

  • Ability to sort records prior to comparison by user chosen field (key)
  • Comparison level must be at least two levels
    • Record: Add/Modify/Delete
    • Field(s): Add/Modify/Delete
  • By default, all fields will be compared but can be overidden by subset of user-defined fields
  • Ability to toggle case sensitivity comparison at any level (one field is case sensitive while another is not)
  • Output must have at least two user selectable options
    • Format (plain text, HTML, csv, etc)
    • Field order
  • Ability to select and ignore added/removed columns when comparing at record level

Here is a very rough outline of the program logic I was thinking of:

  • Sort the files by the key field prior to comparison where the key field will be first followed by the remaining fields in ASCIIbetical order (potentially very expensive upfront operation)
  • Use a weave method so that only two records from each file need be loaded into memory at any one time
  • Step 1: If they match, proceed to individual field comparison if flag is set (Step 2)
  • If the "new" key is less than the "old" key, this is a new added record -> get another record from the new file and go back to step 1
  • If the "new" key is greater than the "old" key, the old record has been deleted -> get another record from the old file and go back to step 1
  • Step 2: If the field name matches, proceed to field value matching if flag is set (Step 3) - Note: If the user has only selected test at record level with no other options, it may be worthwhile to compare the raw lines from the files and not the individual fields
  • Follow same logic as Step 1 except paying attention to field ignore flags
  • Step 3: Determine if the data is new (old is blank and new is not), delete (new is blank and old is not), or modify (both are not empty but not the same) and then get the next record from both the new and old file

So what advice can you offer? Any code snippets, current available products, implementation strategies, etc, etc will all be very appreciated.

Cheers - L~R