comment on

My suggestion? Start simple. I mean as basic as you can. Something like, "always store the original XML in a dedicated dated reference table". Some people might balk at that, but from my frame of reference, you don't want to throw anything out in case you need to restart. It also creates a common DB-centric basis for schema building, even if you have to throw away attempts. Since it is expected that you already have tools capable of producing most of what you want from the XML files, you don't lose any ground here.

Follow-up with some basic schema strategies. I, personally, would consider creating tables with the smallest common unique-key pattern while containing the largest field set. Plan on creating new timestamped rows, processed from each XML file and stored separately. That way you can back-trace what changed in each record and when it occurred. No matter what you do, you can't do it wrong because you have retained the original basis and the reports you came in the door with. If some XML field groups or components are expected to change, you can sub-group or pack them into a container field for later processing. My favorite container format is JSON, but YMMV.

Once you have found a schema building strategy that seems to work, you can extend your basis to this foundation, and continue your work into sub-table building. Certain patterns emerge as you carve out bigger chunks where smaller chunks become apparent. You can then diversify your storage tables for those features.

At some point, you will have a lot of not-normalized data in your database. That really is not an issue because what you are really targeting is the retention of the original XML file, and the resultant table system from it. If you plan your stairstep approach properly, all the data you need is redeployed into your current stable construction, which should allow you to drop some of those earlier exploratory tables.

Obviously, in this plan, you need to be fairly flexible on the reporting side. Every stable stairstep will need to have the schema changes taken into account. If you plan this carefully, your work acquiring the stored data should be extremely straighforward. Simple object-oriented setters/getters based upon revision should be all that you need to get the DB fields in the report generation system.

I am not familiar with Mergent Global Company Data, but being XML, there should be a ruleset with which the documents comply. It may be readable by you, or it may be white line noise. Even if it is the latter, it may provide you with some insight into the deviations your data might take.

Given that the previous developer is gone, I would plan to avoid any previous work that deviates from the plan I have suggested. My plan gives you a basis, a development path (from which you can withdraw and retry), and a methodology that permits the project as a whole to mature, produce intermediate stable results, and eventually produce a completely integrated XML file to DB storage solution.

In reply to Re: Building a database from XML data feed by snopal
in thread Building a database from XML data feed by mattr

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


"be consistent"
	PerlMonks