http://www.perlmonks.org?node_id=133799

I was tweaking a rather complicated data processing script of mine last week when I ran accross an output pipe I was using to massage the data:
my $output_pipe = "|sort -n -t, -k3,4 | uniq > $tmpfile"; open(OUTFILE,$output_pipe) || die "could not open file $output_pipe fo +r output: $!";
I said to myself: "self, why are you using two processes (sort and uniq) when one (sort with -u flag) would work? That's silly." So, even though it had nothing to do with what I was tweaking, I changed the output pipe ot be:
my $output_pipe = "|sort -n -t, -k3,4 -u > $tmpfile";
and thought to myself "hah! got rid of one extra step."

Now, yesterday I noticed that the script was producing erroneous output. Since the input data often changes (depending on who produced it!) I have no formal spec for the input and I'm often tweaking the processing script to adapt it to new input formats (be liberal in what you accept). I thought that must have happened so I spent a good part of the day trying to track down what had changed. Since the data is several hundred MBs in size this process took a while.

Once I finally exhausted all possiblity that the input data was bad I turned to the script and finally I decided to test to see if "sort -u" was really equivalent to "sort | uniq"

my test data:

$ cat test.dat 1,1 1,2 1,3 2,1 2,4 2,2 2,1 1,1

$ sort -n -t, -k1,2 test.dat | uniq 1,1 1,2 1,3 2,1 2,2 2,4 That's good.

$ sort -n -t, -k1,2 -u test.dat 1,1 2,1 Not what I expected.
Aha! With multiple keys, sort -u does not behave as I had assumed.

The moral of this little story: If it's not broken, then don't mess with it! (Oh, and don't assume either!)

Update: In light of the excellent comments and feedback here, I think I'll modify the moral of this story:

  1. Don't assume. Ever.
  2. Make sure you have unit tests.
  3. Run your unit tests. Frequently.
  4. Don't change something unless you have a good reason to.
  5. If you do change something, see rules 1,2,3,4 above

Replies are listed 'Best First'.
Re: If it's not broken, don't fix it
by bmcatt (Friar) on Dec 21, 2001 at 20:54 UTC
    Corollary Number 1 to RhetTbull's law:

    If you do mess with it, make sure it still works.

    Been bitten by that one more than a few times. :-)

Re: If it's not broken, don't fix it
by merlyn (Sage) on Dec 21, 2001 at 22:21 UTC
    Somewhere in my Unix experience, I picked up on the fact that sort -u only "uniq"s over the fields that it is examining, so to get the exact equivalent of sort | uniq, you need to do a final "sort" on the entire string.

    This is a feature, not a bug.

    But I disagree with your conclusion. Your error is in the prior statement:

    Aha! With multiple keys, sort -u does not behave as I had assumed.
    The good ol' "assume" word there. Bad. That's the error, not messing with unbroken things. I may agree with your statement in principal, but this was a bad example, as it's sort of a double fault accident.

    -- Randal L. Schwartz, Perl hacker

Re: If it's not broken, don't fix it
by MungeMeister (Scribe) on Dec 21, 2001 at 22:54 UTC
    I agree with the moral of the story. However, you were not completely off on your idea to condense to 1 statement. Using your test, the following works:

    sort -n -t, -k1,1 -k2,2 -u test.dat

    Sort needs to look at the 2 fields independently. I used to do a lot of shell scripting, and sort was a frequent tool. The other thing to watch for is some systems (difference between Sequent DYNIX and HP/UX that I know of) requires '-t ,' and others '-t,'. If you don't get it right on your system, sort acts very strange.

    None of this is documented well, that I've seen.

    As someone else put it, "Unit Test, Unit Test, Unit Test".

    I very frequently, will do the test you did BEFORE implementing a change, just to verify my understanding of how things work....

    While we are giving advice:

    Never believe a customer/user when they say "always" or "never"....
Re: If it's not broken, don't fix it
by JPaul (Hermit) on Dec 21, 2001 at 21:54 UTC
    Here here!

    This is a fundamental law that needs to be learned amongst Geeks these days. The need to fiddle and twiddle seems to be pretty well ingrained, and yet it rarely leads to immediate success, but instead hours of trying to work out how to make it do whatever it did before, again.
    I have always been of this belief, something instilled in my by my Mechanical Engineer father - but this ideal became a mantra while working at an ISP:

    The System Admin spent most of his time doing Adminy things, and then whatever time was left he would be bored, and to alleviate said boredom, fiddle with things.
    Upgrade this software package, tweak this configuration, install this monitoring software, so on, so forth.
    And, invariably, something would go wrong, and for two days the rest of the ISP would be complaining that X doesn't work like it did - or at all.

    When this System Administrator left the ISP (after finding a much higher paying job), I took over the role of Administration, and being also the only programmer who was spending most of his time previously on an eCommerce platform, I made things work in the System - and then went back to programming.
    Lo and behold! Things just kept on working, and working, and working - thusly the moral of the story is simply:
    If it ain't broke, don't fix it.
    Because if you don't let it alone, you're in for more trouble than its probably worth.

    JP,
    -- Alexander Widdlemouse undid his bellybutton and his bum dropped off --

Re: If it's not broken, don't fix it
by dws (Chancellor) on Dec 22, 2001 at 02:22 UTC
    The problem with If it's not broken, then don't mess with it! is that it can become a management mantra that leads to premature code ossification. ("Ossify" means "to change into bone." In the context of code, it means hardening against change.) You want to be able to "mess" with stuff in a controlled fashion. That's what "Refactoring" is all about. If you don't refactor, you end up with Big Balls of Mud.

    The last thing you want is some pointer-haired manager wandering about doing the "It's done! Ship it! And for diety's sake Don't Touch It!" That road leads to a place you don't want to go.

    I'd rather see this written as

    Don't mess with working code unless you have unit tests!
    Otherwise, how can you be sure?

      You and the others here have provided excellent feedback. I've modified the moral(s) of my story accordingly. Thanks for the feedback!
Re: If it's not broken, don't fix it
by dragonchild (Archbishop) on Dec 21, 2001 at 21:08 UTC
    Corrolary Number 2:
    Even if you test your change, mark it as a possible point of failure.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

Re: If it's not broken, don't fix it
by tachyon (Chancellor) on Dec 21, 2001 at 21:09 UTC

    Unit Tests

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print