Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

Parse PHP or Perl and Reconstruct MySQL Schema

by Anonymous Monk
on Jun 04, 2006 at 05:22 UTC ( #553466=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I've just learned of a situation where, do to an unfortunate lack of communication between a website owner and the server admins, the site owner has a copy of all of the PHP for his site but none of the MySQL tables that the site used. In order to make the site functional again someone just needs to read through the PHP and reconstruct an empty database to fit the code. I haven't seen the application yet, but I don't think it is too complex so it probably isn't a very difficult or time consuming project. However, it got me thinking: is it possible to write a program that would parse PHP or Perl code and automatically generate the SQL necessary to construct the database the PHP or Perl was expecting? It seems like an interesting project to me. My first thought is that one way to tackle it in Perl would be to write a DBD driver that intercepted calls to the database, checked for appropriate tables and columns, and created them if necessary.

Any other ideas?
  • Comment on Parse PHP or Perl and Reconstruct MySQL Schema

Replies are listed 'Best First'.
Re: Parse PHP or Perl and Reconstruct MySQL Schema
by roboticus (Chancellor) on Jun 04, 2006 at 12:45 UTC
    I apologize in advance for the length of my response, but I've had to do this sort of task a couple of times...

    Depending on the coding style of the original programmer, the task could range from straightforward to absolute chaos. If the programmer uses a consistent coding style, e.g., using $sql for SQL text, explicitly naming columns to be retrieved, etc., then it can be a pretty easy job. However, you'll still need to figure out the proper indices for the tables, as hesco mentions, and as samtregar mentions, type information could be difficult to infer.

    On the absolute chaos side of things, the coder(s) could use inconsistent styles, making the SQL difficult to locate. They could just use SELECT * FROM TableName, as described by CountZero and UnderMine. Even worse, the programmer(s) could put all sorts of database manipulation in poorly-named stored procedures, and rely on triggers to do things behind your back.

    So, before estimating the time it would take to reproduce the database structure, be sure to do a quick review of the code to see how it looks. After doing the review, if you still decide to bid on the job (and win it), here are a couple of tips that may help.


  • Scan the code and pull out all SQL that you can find. For each statement, sketch out the tables, columns, and relationship(s) between them. In the immediately surrounding code, try to infer the meaning, use, type and range of values for each column. Jot down notes on what each column is used for, because you'll be wondering that later! Once you're done, build a new diagram showing all known tables, columns and relationships. (Use a convenient tool, as you'll have to revise this relatively frequently.)
  • Instrument the code so you can verify coverage of all functions, statements, and control-flow paths in the application.
  • Instrument the code to log every SQL interaction with the database: Show the SQL sent to the database (to decipher dynamically-generated SQL), and show the results returned by the database.
  • Write the DDL to create your database, and use it to create an initial database. Also write a DML script to insert any data that you know must exist from reading the code.
  • Now the grunt work...

    Now you've got an iterative process before you. Create your database with the DDL script, populate the data with your DML script, and run your application.

    For each run, try to exercise as many code paths as possible, and see what happens. In some runs, be "kind" to the application (i.e., give it "reasonable" data values) to help you get further into the application. On other runs, be "mean", giving the app oddball values and try to flush out any bugs, inconsistencies, etc., to help shed light on the applications interaction with the database.

    Document each "interesting" test case that exercises some feature of the database and application that you don't already have a test case for. Also, try tests that may help expose "hidden" functionality in the database (those aforementioned stored procedures and triggers.) For example, after inserting a new record, check all the reports offered by the application to verify that the appropriate data shows in all locations. If not, it may give you a clue as to something new you'll have to add. (Here, interviewing the app's owner can be helpful. For example, they may tell you that the data normally doesn't show up until the next day, indicating a database maintenance task that runs once a day, or they could indicate that it should show up immediately, indicating that a trigger might be needed.)

    Each time you learn something new about the database, update your initial diagrams with notes, e.g., "the contents of table.column always seem to be 'Y' or 'N'", and "Whenever table2.columnFoo holds 'M', there should be a record in table3 where the key is table2.columnBar+table2.columnBaz. Then update your DDL to add the new knowledge. As required/desired, add to your DML script to help exercise code paths, generate 'what if' cases to test, etc.

    Also, when the application runs correctly, dump your database tables as DML so you can add it to your original DML script to load more data into the database on future runs. (Many features may not be testable without a reasonable quantity of test data.)

    Constrain your data

    In a relatively short period of time, you should be able to have a pretty good picture of the database. Then you can start looking over the relative frequency of various queries. Analyze this usage pattern to determine what indices to create on which tables. Analyze your table diagrams and see what foreign keys to add to your tables, what constraints make sense, and start tying down your database. After all, once you let crap data get into your database, your application will have to work much harder to produce reasonable results. Whenever possible, use constraints, foreign keys, default values, etc.

    Database Maintenance

    Finally, once you've done all that, start thinking about how the database will grow, and what sorts of maintenance it will require. Are there tables that will need to be cleaned out? Will some data expire? If database maintenance code already exists, be sure to test it out. If not, write what you think you'll need--and be *certain* to test it out. (You wouldn't believe the amount of database maintenance code that fails when it finally triggers because the authors either don't test it, or don't test it after schema changes.)

    Save your work

    Finally, put *all* your work in a binder and tuck it away for the next time you need it. (Even if it's not particularly neat, put it in the binder. If you never need it, it won't matter that it's not neat. But if you do need it, you won't mind that it's a bit of a mess. And believe me, you'll need it. About six months from now, a case will come up that you never tested (no matter how hard you try!), and you'll have to fix it. That's when the binder will be useful! Be sure to add the funky breakage to your test cases, and add any required data to your DML script, so when you make changes in the future, you can exercise that new test case. A lot of work? Yeah, it can be. But it's less work than just hacking at it. Especially the second time, when it's time to fix it later or add a new feature. That documentation will nearly always pay for itself.

    This is what works for me, YMMV.


    Why, yes, I *do* deal with many large databases and crappily-written applications with no documentation. How ever did you guess?

      Your approach seems so thorough that it makes me wonder how you handle the inevitable "Situation Theta" when it happens.

      (where "Situation Theta" is defined as the set of all 'features' in the code introduced for reasons unknown, even to the original developer(s) at the time of introduction, but left in place anyway, for whatever reason, or no reason at all)

      Get enough "Thetas" and you may wind up 'rewriting' features that were never really implemented to begin with. I hope *you* got the credit for those. ;)


        I currently have a similar situation in respect to transfering an application from one server to another. The client says X doesn't work and we go back to the old server and lo and behold X never worked.

        Now your account handlers have to earn their money by explaining X never worked. Even more fun when the client has already paid for X and been told it was working.

        Transfer of applications from one company to another or one group of developers to another is never fun.



        I've never heard it called "Situation Theta" before, but I like the name. Unfortunately, I don't have a strategy for detecting those. So currently, I just waste my time trying to get the code path(s) executed and corrected, only to be told by the client that "Oh...that feature never worked, and we don't really need it anyway.". <sigh...>

        If you're a nine-to-fiver, it's a real bummer. If you're a contractor, at least they get to pay for the time!


Re: Parse PHP or Perl and Reconstruct MySQL Schema
by samtregar (Abbot) on Jun 04, 2006 at 06:23 UTC
    It might be possible to get the general structure from such a reverse-engineering DBD, but you probably can't get any type information. One problem I see with your approach is that your average app is expecting to get real data back from DBI and will probably crash early if you can't provide reasonable responses.

    A simpler approach would be to scan the code for strings that look like SQL. You could go with the low-tech approach of scanning for strings that begin with select, insert, update, and delete, or use SQL::ReservedWords::MySQL to build a more accurate regex. Then, if you're feeling really ambitious, send the strings through SQL::Statement to pull out the tables and columns.


Re: Parse PHP or Perl and Reconstruct MySQL Schema
by hesco (Deacon) on Jun 04, 2006 at 08:10 UTC
    I had to do this once for some code I had written perhaps a year or so earlier without having ever documented the schema. It can be done. Your suggested approach seems onerous to me. I think I simply grep'd my script for every declaration of the $sql, as I always name my queries in that way (or $sql1, $sql2, etc, for nested queries).

    That gave me plenty of information to reconstruct a framework for the database. But then I spent a couple of days exercising various paths through the code to flush out all of the errors. When one would surface, I'd track it down to what was missing in my schema, run my alter commands, update my build_schema.sql script and test some more.

    Eventually after enough of this, my code started running w/o errors and I used a db dump to take a snapshot of the data schema, compared it to what I had developed as I went and resolved the differences. That then got stashed with the code so I didn't have to go through that again. The tricky part was restoring the indices. That took the sort of consideration that would have gone it to the initial development. And there were no error messages telling me I had fed in the wrong number of bind variables or inserted data into a column that didn't exist. Responding to errors is pretty straightforward. But the absence of errors is not proof that the code works as intended. So a thorough code review at the end would seem to be in order, preferably by someone not involved in reconstructing the package.

    -- Hugh

    if( $lal && $lol ) { $life++; }
Re: Parse PHP or Perl and Reconstruct MySQL Schema
by CountZero (Bishop) on Jun 04, 2006 at 08:24 UTC
    Interesting solution you suggest, but AFAIK there is no "empty" DBD-driver available which does what you want.

    Also, most probably as soon as the first SQL-command is run and it does not return with the expected data, the script will crash, die or otherwise stop working.

    FInally, it is sufficient to have one "SELECT * FROM table" to get totally stuck as you will never find out what fields were expected, unless you go by hand through the code and trace what all the returned data is expected to do / contain.


    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Parse PHP or Perl and Reconstruct MySQL Schema
by UnderMine (Friar) on Jun 04, 2006 at 11:33 UTC

    I do not believe such a thing would be possible in a DBD driver.

    You need a look at the SQL statements that the system uses. If you have 'SELECT id FROM user' this makes things simple. However if you have "SELECT * FROM $X" it would require you examine the surrounding code. A lot of systems abstract the DB calls into a module. If this just wraps a simple SQL execute that is not much of a problem but if the SQL is generated on the fly it could a long time to work out what is happening.

    The easiest way to reconstruct the schema would be to write a script that first extracted the current SQL and its location in each script. You might be able to create a fake DBD driver that dumps the SQL to a log and then returns an empty results set. Again SELECT * is your enemy. It should never be used but people still do it and never think of what happens if someone addes a blob contain Gbs of data to the table.

    Thinking about this you might be best with a DBD driver that returned an empty set for know SELECTS but died on SELECT * allowing you to investigate further. INSERTS, DELETES AND UPDATES should just be logged.

    Such a DBD driver would be interesting but would probably be more work than parsing the original code for SELECT, UPDATE, INSERT and DELETE.

    Hope it helps

Re: Parse PHP or Perl and Reconstruct MySQL Schema
by holli (Abbot) on Jun 04, 2006 at 18:17 UTC
    Just to note the obvious, are you sure the client nor the provider have a backup or dump of the database? Maybe you should ask your client insistently again.

    At my place I had to reconstruct a database too. It was not so bad, just three tables. But of course the after having finished it, a dump on dvd showed up in somebodys desk.

    holli, /regexed monk/
Re: Parse PHP or Perl and Reconstruct MySQL Schema
by TedPride (Priest) on Jun 04, 2006 at 21:02 UTC
    All you can do is get a list of all the queries used; sort out and arrange the queries into sets of fields for each table; then assign a type for each field based on (a) its name and (b) the form fields that feed data to it. For instance, if a field is named addr1 and the form field is <input type="text" name="addr1" size="50" maxlength="100"> and you want to make it fixed-width, the logical size is 100 characters. If on the other hand the field is named ltime, you'll have to go into the code and check context to decide which type of form field to use.

    The database driver idea probably won't work by itself, since many user-run queries are based on the data that's already being displayed, and if it's not being displayed, then you won't get any queries. This is something to look at if the first method fails, though I'd personally just write a small PHP routine to log calls, then rewrite the query lines to require / call this routine. Turning off the logging later will just be a matter of removing the part of the routine that does the logging. Any monkey can do this in under an hour by reading through the necessary part of the PHP documentation, writing a simple routine, then doing a find on all the PHP files for mysql_query and replacing those lines.

Re: Parse PHP or Perl and Reconstruct MySQL Schema
by dragonchild (Archbishop) on Jun 05, 2006 at 02:57 UTC
    You can use DBD::Mock to find out what statements you're sending to the DB. This also has the side-benefit of creating tests for your DB interactions.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Hidden Interactions
by kbrint (Sexton) on Jun 05, 2006 at 21:02 UTC
    There are also server-side interactions that may never appear in your PHP code:
    • Views
    • Stored Procedures
    • Triggers
    • Constraints
    • Indexes (performance-impacting only, results should still be correct)
    • Permission GRANTs
    Unless you get all of these right, the web site might not work like it used to. All of these complicate a general solution to the problem. You'll have to use your intuition about how things should work behind the scenes.
      Give that the system appears to be a port of an existing MYSQL database it is unlikely that Views, Stored Procedures or Triggers will be an issue as there features were added in MYSQL 5. Views, Triggers and Stored Procs

      Historically MYSQL databases have not tended to use Foreign Keys to the same degree as Oracle and other relational databases. Foreign keys only work with innoDB tables FOREIGN KEY Constraints

      Subselects were added with MYSQL 4.1 SubQuery Support. I recently came across some historic code that attempted to get around this by creating temporary HEAP tables on the fly. :(

      Grants may be an issue especially if you have any exotic code (like creation of temp tables).

      Hope it helps

Re: Parse PHP or Perl and Reconstruct MySQL Schema
by swampyankee (Parson) on Jun 05, 2006 at 20:02 UTC

    (following paragraph added in update)

    I'm being very presumptuous here, in that you didn't mention trying to get your hands on the specs from which the PHP and Perl where written. If you have tried to get them, I apologize.

    (end of addition)

    Considering that somebody managed to lose an entire database (and its backups?), I guess it would be too much to hope for somebody actually having the documents describing the database tables floating around.

    My thinking is that whoever wrote the PHP and Perl code would have needed descriptions of the database in order to do anything. If this documentation is available, even if slightly outdated, it would make your task easier. Even the original specifications from which the Perl and PHP were written would help.


    e(π√−1) = −1
      This is a classic example of why you should include a database schema as part of your release (but not globally accessible as that a security risk).

      There are various pieces of software that can generate schemas one of the better and cheaper is Dezign for Databases.
      For simple to use, generates DDL schemas for large variety of databases, can reverse engineer schemas from source or database connections.
      Against only available for windows, cost varies between $230 and $700 depending on version.

      Used this on a number of projects, especially useful when the coder never documented his database and just let it evolve.

      Hope it helps

Re: Parse PHP or Perl and Reconstruct MySQL Schema
by qazwart (Scribe) on Jun 06, 2006 at 13:18 UTC
    Did you write this code yourself? If you did, it might be quicker to simply rewrite the application than attempting to figure out what the tables looked like. Even if you didn't write the code, you might be better off doing the application from scratch anyway.

    I've been in similar situations, and sometimes I realize I simply had to bite the bullet and start over than to go through the code to reconstruct the SQL tables. It can take a lot less time to simply rewrite than to figure out what the tables looked like, testing your assumptions, and rebuilding everything again.

    On the plus side, I usually happen to like the rewritten code much better. I have a better understanding of the application, and how the code interacts with itself. Plus, features and modifications that were squeezed into the code are now part of the base code.

    One of the lessons learned is that your code should generate all the needed SQL tables, views, etc, and fill it with the needed preliminary data. It may seem like a waste of time spending a lot of effort on code that should only be run once, but this type of stuff happens a lot more often than you think.

      I'm with you on keeping code to re-generate the databases.

      I keep a few files (create_tables.sql, create_indexes.sql, create_procedures.sql, create_triggers.sql, etc.) for each schema (oracle), and keep a directory near the control files, so it gets backed up regularly.

      Whenever I make an adjustment to the database, it gets reflected in the files -- and I append, not modify the existing. (if I drop or add a new column, it's an alter table, so I don't risk creating typos or other mistakes in the earlier stuff that I'll have to manually debug in a disaster recovery situation ... well, with a minor exception for a complete drop table, I'll comment out previous sections.)

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://553466]
Approved by neversaint
Front-paged by grinder
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2023-12-09 06:15 GMT
Find Nodes?
    Voting Booth?
    What's your preferred 'use VERSION' for new CPAN modules in 2023?

    Results (37 votes). Check out past polls.