http://www.perlmonks.org?node_id=928819

raybies has asked for the wisdom of the Perl Monks concerning the following question:

This might be a meditation of sorts, dunno... but I'm kinda stalled.

Essentially I have a large table of data that I was able to extract from a large C source codebase. Essentially each row represents a sourcefile/line# where there's a particular function call.

Further, Each row (function call instance) must be painstakingly checked and examined by users manually who must examine the codebase, classify the entries, and insert comments.

One of the end products of the data will be a comprehensive report on the system as a whole.

To make matters more complicated, the codebase may change over time, thus some of the automatically gathered data may need to change while the user entered records associated with those entries should stay the same and tied to them.

Currently with a couple scripts, I can gather the information automatically into a single hash table, keyed by the sourcefile:line#. (of course the line# may change in the future. entries may be deleted and added over time too...)

I'm thinking one approach might be to keep the latest "snapshot" of code. And periodically go to the codebase and generate the data fresh--then write a compare script to look for differences in the latest codebase repository.

I could just dump the hash (with storable) to save the actual data, but that's not something I can have multiple users manipulate say in a text tool to attach notes, change field values, etc.

My skills with SQL are pretty limited. (never been formally trained in DB/Information Management, so I consider myself the perpetual noob) I haven't a whole lot of experience with it--and I'm in a time-crunch (of course) so I'm struggling with jumping into a tool that's so huge that I drown in something like properly "formalizing my schemas"... or some such detail that I just don't know about yet. Then again the short answer might just be "Suck it up!" and "Get over it."

I'm curious at a higher level how one goes about managing data like this once they have it.

Replies are listed 'Best First'.
Same thing as last night, Pinky, try and take over the world!
by blue_cowdawg (Monsignor) on Sep 30, 2011 at 14:30 UTC
        I'm curious at a higher level how one goes about managing data like this once they have it.

    My first reaction is that you are probably over thinking this thing a bit.

    First off, a few things you need to consider no matter what path you take is

    1. What are your short and long term goals for gathering this data in the first place? Answering this question may help you decided on a proper means of persisting the gathered data now that you have it. Flat text files? Sybase? MySQL? NoSQL? Perhaps a combination of the two.
    2. What is the end product or products that are going to come out of this?

    Once you have decided a strategy then go forward with it. If you are queasy at the thought of working with SQL then I'd partner with someone skilled in it and let them figure out your table schema and even write queries for you. Take the queries and encapsulate them in a Perl module and forget about them once they are working the way you want them to.


    Peter L. Berghold -- Unix Professional
    Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg

      Well the programs run a large complex "machine" (keeping it generic). Each piece of the machine has a number of realtime programs that run them, seperate control systems, and they communicate with a central computer system/display.

      I've collected all the messages that come from each of the subsystems that communicate with this central display process (by examining the actual C sourcecode of each program used in each machine, because that's where the messages originate from). They're scattered through a dozen independent realtime systems with interchangeable components, etc.

      The users of the whole system want to know why some of the messages (often error messages) appear on the display even when there isn't a problem and other times they simply don't know if the messages are important.

      Many of the messages have been misclassified, so part of this job would be to analyze the messages, and rank them according to how severe they are.

      I may eliminate some of these messages directly in the code.

      Some may need to be made more severe.

      There are hundreds of possible messages coming from each machine, and the users have no idea what all of them mean--as many are very cryptic.

      So part of the job will be making the messages more user friendly.

      And I will want to make notes on these items.

      Finally, I need a way to report on all the messages of a given severity or by which piece of hardware they originate--probably in a table.

      Essentially providing the users of the system with documentation (because they were given none from the developers)...

      I guess it gets complicated because the data I've collected may affect the source. the source is where I originally generated the data, and because every message is identified according to its position in the source code, that position could change if someone adds a single line of code to a file with multiple messages in it.

      We don't have a lot of DB experts around here... they're mostly engineers with very little experience in software architecture. I'd be in the same boat, but I've picked up most of my "good" habits from my love of Perl. it just kinda makes you think about doing things in a better way.

      I appreciate the feedback. I know I sound a bit distracted and scattered--and I admit the vague nature of this is somewhat more of a question of software architecture, than one of Perl--but like I say, the smartest minds I know in computers tend to be here. And Perl just makes things that much easier.

        Since you can edit the messages in the C source, I suggest adding a unique but easily distinguished error ID code to each one. For example:

        "Something bad happened. (err0r_aaaa)" "Oh crap - you're hosed. (err0r_aaab)" "The thingamajig dumped. (err0r_aaac)"
        Then a simple $ grep -rin err0r * in the top-level source directory will result in something like:
        some/dir/something.c:42: dump_err("The thingamajig dumped. (err +0r_aaac)"); some/thisthing.c:32: log("Something bad happened. (err0r_aaaa)"); thatthing.c:41: return "Oh crap - you're hosed. (err0r_aaab)";
        which you can split into path/filename, line#, and error message to update whatever you use to track the error messages.

        Bonus, whenever someone receives an error message you can use the (idcode) to determine the exact source of the message, even if different .c files have otherwise identical messages.