http://www.perlmonks.org?node_id=581058

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I'm currently involved in a web project with a friend.

It's the first time I'm jointly coding a website with another person. We sometimes have quite different approaches to the same problem, and we fail to see eye to eye.

Here's is one and I hope to seek your advice and opinions.

We are divided over the how to treat the "removal" of a post from a forum. The questionable post will still be somewhere in database, but we need a way to indicate that action was taken against it.

To make things concrete, we've two MySQL tables (simplified) as follows:

posts_tbl
post_id author_id message deleted_flag

removed_tbl
post_id message deleted_by_id date

Let's say we have too approaches A and B.

When a message that is deemed inappropriate is removed via a button click, Approach A goes likes this:

In posts_tbl, the "message" field is updated with the replacement text "Removed by USERNAME" and the "deleted_flag" column with a value of 1 (the default is 0) (Note: the username is available at the point of update, so it's updated into the message filed as part of the replacement text).

In removed_tbl, the original message is inserted into the message field, together with the post_id, the date, the member id of the person carrying out the deletion.

Approach A's points are:

a) When displaying posts in a web page, all posts can be directly retrieved from posts_tbl, including messages that contain "Removed by USERNAME".
b) Saving of processing time because there's no need for additional code to check whether a message was removed.
c) Greater flexibility because different pieces of data are stored in two different tables.
d) Greater expandability. When a different reason for removing the post is needed, a new column can be added to removed_tbl to store that reason.

The sql to retrieve messages from posts_tbl will be straightforward without having to link to removed_tbl to retrieve the username of the person who deleted the post (assuming the other information in removed_tbl is not needed for display). The code will be simpler too.

Approach B is:

removed_tbl should be changed to:
post_id deleted_by_id date

When a removal is carried out:

In posts_tbl, the "message" field remains unchanged while the "deleted_flag" column is updated with a value of 1.

In removed_tbl, the necessary information is inserted (post_id, member id of the person deleting the message, date)

Approach's B points are:

a) The original message should not be touched in posts_tbl.
b) There's duplication of data if the message field of posts_tbl is updated with the string "Removed by ...".
c) There's also duplication of data because "Removed by ..." and "deleted_flag" both indicate the particular post has been removed.
d) The string "Removed by ..." should be added in the code for outputting to a webpage, so that it doesn't get stored anywhere. When a different wording is needed e.g. changing "Removed by ..." to "Deleted by ...", only that the hard-coded string needs to be changed.
e) posts_tbl stores all original posts, so it's easier for retrieval.

Cons of Approach B:

a) The sql to retrieve messages will require an additional condition to link posts_tbl to removed_tbl to retrieve the username of whoever deleted the post.
b) The code will need to check for whether a post has been removed so it can output the string "Removed by ...".

There are other ways for sure. But between the two outlined above, which is a better solution?

It's a bit longish but I need to paint as accurate a picture as possible.

Would appreciate your enlightment :)

Thank you and I look foward to hearing from you :)

Replies are listed 'Best First'.
Re: [OT] When coders don't see eye to eye
by bart (Canon) on Oct 28, 2006 at 09:24 UTC
    I see no reason for the two tables. I'd just store the extra info of the second table, in an additional column in the former table. It's just one column! And the relationship is one to one. (Well ok, you could add additional info, like timestamp of deletion and a reason.)

    I think you underestimate the power of SQL to just show the content of the posts that were not deleted.

    Even if you use 2 tables, the second table (the extra info for deleted posts) needs only contain the post_id and the additional info. Again, SQL can easily join the two tables into a single query (A left join to get all messages, an inner join for the deleted ones). If your database supports views (and I doubt MySQL does, at least the older versions don't), then you can create a view to hold the query, and access it as if it was a single table. Without views, you have to hardcode the join every time, which still isn't so bad.

    The redundancy in your approaches are just a major sin against normalisation: you're storing the same info twice, and if something goes wrong, the message text in both versions don't even have to agree.

      Many thanks :)/p> Storing the extra info in an additional column in the posts table has the advantage that it's neater because everything is on that single table. But it incurs the cost of an extra not-so-useful column for every row in the table.

Re: [OT] When coders don't see eye to eye
by jbert (Priest) on Oct 28, 2006 at 08:21 UTC
    Well, they both work, so the main thing to do is flip a coin, implement one and move on. :-)

    But in the spirit of waffling on, try thinking about the thought processes of a future maintainer. How likely are they to break things (and how serious is it if they do)?

    One thing which you can end up doing a reasonable amount of on a live system is going in with the SQL prompt and taking a look. There's something to be said for having a data model which allows you to run ad-hoc queries without too many joins.

    Another way of breaking the deadlock might be to think about if you choose either approach, how easy would it be to switch to the other (when you've got a bunch of live data)?

Re: [OT] When coders don't see eye to eye
by graff (Chancellor) on Oct 28, 2006 at 12:36 UTC
    jbert's points about maintainability are well taken. A similar point that occurred to me as I read the two descriptions: what sorts of things can go wrong with each solution, and what would be the relative consequences in each case?

    For example, suppose someone is doing "table maintenance" (i.e. using some more general-purpose interface to the database, not the website cgi scripts), and makes some sort of mistake involving "update posts_tbl set deleted_flag=0 where ..." Suddenly, posts that had correctly been deleted are no longer marked as such in this field. Under approach A, their contents still won't appear on the website, whereas in approach B, they will. The converse case ("set delete_flag=1 where ...") will play out in a similar way.

    If that sort of thing could happen, then of course it will (cf. Murphy), and in that scenario, using approach A might be better for the sake of your readers. Then again, that sort of "table maintenance" becomes harder to do under plan A, because you have to copy the message content into one table and also delete it from the other, in addition to (re)setting the "deleted" flag. Of course, you can define stored procedures for "deleting" and "undeleting" posts to make this less of a problem.

    That said, I would still go with bart's idea because it's simpler, and that extra simplicity is well worth the trivial cost of adding a column or two to the posts_tbl that will be null in the vast majority of rows. It does expose you to the sort of mishap scenario I described above, but that's what rollbacks and backups are for...

Re: [OT] When coders don't see eye to eye
by BerntB (Deacon) on Oct 28, 2006 at 08:15 UTC
    My web development experience is quite old, but...

    Shouldn't you also store a motivation why the post was removed?

    You might want to add support for multiple languages -- so don't store hard wired strings into the database without a good reason.

Re: [OT] When coders don't see eye to eye
by dws (Chancellor) on Oct 29, 2006 at 05:23 UTC

    If you were to add a use case that allowed for the removal of a "deleted" indication, Approach A would require that you read a tuple from the remove_tbl, delete that tuple, then use the data from the tuple to update the posts_tbl. Approach B requires a single delete.

    If you were to decide on changing the wording of the "removed by" message, Approach A would require a moderate amount of work and data rewriting. Approach B requires a single change in code.

    The Cons you list for Approach B are minor. Assuming the right index, the outer join to get a post and any remove data won't be an issue unless you're seriously high traffic.

    I have trouble imagining circumstances where I'd even consider A as an option.

      I have trouble imagining circumstances where I'd even consider A as an option.
      I agree in general approach B is better, but I think there are circumstances where it's not. In particular, for a more complex application where there are lots of places that touch this data, and you want the vast majority of them to ignore those marked as deleted, it might be easier to have the deleted entries not be in the main table at all (which I realize now is different from what the OP posted) but rather in a sort of duplicate table off to the side used only for administrative/reference purposes.
        Only if the majority of your posts will be deleted. Which I assume won't be the case. If you have lots of posts, and lots of code querying them, you want access to them as fast as possible. The access will be faster if the database can cache more. Databases cache by page. If a row is retrieved, the entire page the row will be in is fetch from disk, and kept in cache. The smaller your row, the more rows can fit in a database. The amount of pages that can be cached is limited by the environment. So, if you have more rows per page, then you have more cached rows for a given fixed number of pages that can be cached.

        In other words, approach A wastes memory. And that can hurt performance. (Of course, it may not. Unless you measure you won't know. And maybe the difference in performance is insignificant. And even if A has better performance, it has to make a huge difference before I choose such a dirty design - it's feels worse than running without strict and warnings).

        I wouldn't even have a 'deleted' column in the post_tbl table. This information can be deduced from the post having an entry in the removed_tbl table.

Re: [OT] When coders don't see eye to eye
by castaway (Parson) on Oct 30, 2006 at 08:33 UTC
    It's not "which is better" that you need to look at, but "Which better fits our needs"? As someone already pointed out for example, storing "Removed by USERNAME" has implications as at some point you may want to have localised versions of that message. You may want to link to that username in the message, and if you later change how users are linked to, you'll need to update all those.

    Both your approaches have the two tables needing to be kept in sync, which means you need to do careful transaction work to make sure they don't end up disagreeing. I wouldn't use either, I'd probably have the posts table not know anything about deletions, and the removed table list post_id, deleted_by_id, date etc. SQL will quite happily create the list of all posts with/without deletions for you, you can even build the "Removed by username" message in it.

    select posts_tbl.post_id, case deleted_by_id is not null then 'Removed by ' || deluser.name else posts_tbl.message end as message, user.name from posts_tbl left join removed_tbl on posts_tbl.post_id = removed_tbl.post_id left join authors as deluser on removed_tbl.deleted_by_id = deluser.au +thor_id left join authors user on user.author_id = posts_tbl.author_id
    .. as a rough example of how that works.

    C.

Re: [OT] When coders don't see eye to eye
by Anonymous Monk on Oct 30, 2006 at 14:26 UTC
    Many thanks to all for your advice and pointers :)

    Cheers!