Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Puzzle: The Ham Cheese Sandwich cut.

by Perl Mouse (Chaplain)
on Nov 17, 2005 at 14:05 UTC ( [id://509409]=perlmeditation: print w/replies, xml ) Need Help??

I presume everyone is familiar with the problem of finding a median of a set of numbers (a median of a set is an element of the set so that at most half of the elements of the set are smaller than the median, and at most half of the elements are larger than the median. For instance, of the set 1, 2, 3, 4, 5, 6, the number 3 is a median (and so is 4)). A simple algorithm sorts the set of numbers, and then takes the middle element of the sorted list. But this takes Ω(N log N) worst case, as it needs to sort.

Warm-up problem: write a subroutine that takes a set of numbers (without implied order, and possible duplicates) and returns a median. The sub should run in O(N) time.

Generalizing this to 2-dimensions is easy - to find a line that separates a set of points in 2-d into two subsets each at most half the size of the original set. You'd just ignore the y-coordinate, find the median of the x-coordinates, and pick a line that has this x-coordinate a constant.

But it's more interesting if you have two sets of points: a set of red coloured points, and a set of green coloured points (both in 2-d). Now, there is a line (or more than one) that simultaneously divides the set of red points and the set of green points such that on the left of the line you have at most half of the red points and half of the green points, and on the right of the line, you have at most half of the red points, and at most half of the green points. (This means that if there are an odd number of red points and an odd number of green points to start with, the dividing line will contain at least one red, and at least one green point).

Challenge 1a: Write a subroutine that accepts two sets of points (in 2-d), and returns a line simultaneously dividing the sets as described above.
Challenge 1b: Can you do it in O(N log N) time, where N is the total number of points in both sets?
Challenge 1c: Can you do it in linear time, or prove it to be impossible?

Now, what holds of 2-d, holds for 3-d (and higher dimensions as well). Given sets of red, green and yellow points, there exists a plane that divides the three sets such that at most half of the red, green and yellow points are above the plane, and at most half of each set is below the plane. (This even holds for sets with an infinite amount of points, leading to the 'ham-cheese sandwich theorem' that states you can cut a ham-cheese sandwich into two parts with a single cut such that both halves at equal amounts of ham, cheese and bread - even if you leave the cheese in the fridge).

Challenge 2a: Write a subroutine that takes three sets of points in 3-d, return a plane dividing all the sets as described above.
Challenge 2b: Prove that your solution is optimal.

Challenge 3: Generalize the sub to do the same in any dimension.

Have fun!

Perl --((8:>*

Replies are listed 'Best First'.
Re: Puzzle: The Ham Cheese Sandwich cut.
by ambrus (Abbot) on Nov 17, 2005 at 17:57 UTC

    The warm-up problem isn't so trivial. Two possible solutions are decribed in Cormen – Leiserson – Rivest – Stein: Introduction to Algorithms. One of these is randomized and runs in expected O(n) time. (Update: the other one runs in guaranteed O(n) time, as Perl Mouse has noted in his reply. I'm sorry this wasn't clear from my original post.)

    I've recently implemented this randomized algorithm for perl, although my implementation is not a very efficent one, as it would be possible to do all its operations in place (with only O(n) extra memory and more importantly less time).

    The rest of this post shows my implementation.

      One of these is randomized and runs in expected O(n) time.

      The rest of this post shows my implementation.

      Nice, but the worst case running time is Ω(n2). It suffers from the same problem as Quicksort: picking a random pivot works well often enough to get a good expected running time, but if you're unlucky, it's really slow.

      There is an algorithm to do it in garanteed linear time (although when done in Perl, the constants are so high that for most practical situations, one can better use sorting in C and picking the middle element).

      Perl --((8:>*
Re: Puzzle: The Ham Cheese Sandwich cut.
by Limbic~Region (Chancellor) on Nov 17, 2005 at 14:25 UTC
    Perl Mouse,
    I haven't started working on any of the challenges yet, because I wanted to raise a question first. When I learned about means, modes, and medians in statistics - I thought I remembered learnING that the median of an even list is the average of the two middle numbers.
    1, 2, 3, 4, 5, 6 = 3, 4 = 3 + 4 / 2 = 3.5
    Is that correct? I guess it doesn't matter if it is since the line that bisects the two lists will still be the median.

    Cheers - L~R

    After posting the question, I realized that the answer doesn't matter. *shrug*
      It depends how you look at the problem. If you look at it as the 1-d variant of the "divide sets using a simplex" problem, any number between the two middle numbers will do. However, if you want to write a Quick Sort whose running time is garanteed to be O(N log N), you need to find a median in linear time, and you want to find an element of the set - not something in between.

      Wether you find one of the middle elements, or pick a number in between, I'll accept both solutions. ;-).

      Perl --((8:>*

        Are you sure in this? Once you have found a number so that exactly half of the numbers are to the left and half are to the right, couldn't you separate these two classes of numbers, sort them separately, and still get an O(N log N) time sort this way?

        If I have to define median, I'd say that if you have an even number of data, any number between the two middle one is a median. This way the definition is equivalent then if you say that the median is a number whose total distance from the given numbers is minimal. This latter definition has paraleles: the mean is the number for which the square sum of its distance from the given numbers is minimal. More clearly, given the sequence (x_1, ..., x_N), the mean is the number A that minimizes the expression |x_1 - A|^2 + ... + |x_N - A|^2; the median is M if it minimizes |x_1 - M| + ... + |x_N - M|. Furthermore, informally speaking, the modus C minimizes |x_1 - C|^epsilon + ... + |x_N - C|^epsilon, where epsilon is a very small positive number.

Re: Puzzle: The Ham Cheese Sandwich cut.
by jeffguy (Sexton) on Nov 17, 2005 at 19:56 UTC
    Observation 1 (or maybe it's just obvious): For even numbers of points, there may be more than one correct answer (even aside from trivially jittering the dividing line back and forth a little). Example: two red points (0,0),(1,1) and two green points (1,0),(0,1). Plotting them:
    g r r g
    They can be divided vertically or horizontally. Declaring an odd number of points of each color and requiring that no two points be at the same spot may force a unique solution, but I'm not sure. Wow! This is a tough problem!
    Update: Turns out there are at least some graphs with an odd number of each color of node and where there are multiple correct answers.
    Example:
    Update: I have an n^2 algorithm (not implemented yet, but it works).
      For even numbers of points, there may be more than one correct answer
      Indeed. That's why the puzzle says returns a line, and not returns the line.
      Perl --((8:>*
Re: Puzzle: The Ham Cheese Sandwich cut.
by BrowserUk (Patriarch) on Nov 17, 2005 at 20:29 UTC

    Is this a "tree thing"?

    You insert the points into a (Red-Black?) tree and they effectively sort themselves into the two required groups either side of the median which is ends up as the root node?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I really doubt it because any point can have a median drawn through it. Remember, the median can go in any direction. For any point, you can find some angle to draw a line through that point that will separate all other points of that color into two equally-sized groups. So if it's a tree structure, it's not a standard one where divisions are made parallel to the axis of the graph.

        I've not yet convinced myself that this is soluble in the general case.

        In the 2D case, if all the points in both groups have one coordinate in common, and there are an odd number of points in each group or in the more general case of all the points lying on a straight line at any arbitrary angle.:

        +-----------+ +-----------+ +-----------+ | . | | | | . | | x | | | | . | | x | | | | . | | . | |.xx . x . | | x | | . | | | | x | | x | | | | x | +-----------+ +-----------+ +-----------+

        Unless you consider the line passing through all the points satisfies the criteria of having an equal number of each type of point on either side; ie. none?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      (1b) I have an n*lg(n) solution for 2-D. Sorry, BrowserUK: I was SO wrong in saying no tree. My solution (which I have not coded) uses a PR quadtree. My solution is certainly not the only approach that will work.

      (1c) I have no idea (yet) how this might be brought up to O(n), nor do I have a clue (yet) how to prove it impossible.

      (2a) Also, while a PR quadtree can be used in 3D, my approach does not extend easily to 3D. More thought required.

      Question: For a puzzle, ought I post pseudocode/code on completion, or is it polite to instead leave the solution unposted, giving others the fun of solving it?

      Keep chuggin', guys! Nothing quite so rewarding as solving a tough puzzle!
Re: Puzzle: The Ham Cheese Sandwich cut.
by Anonymous Monk on Nov 17, 2005 at 19:02 UTC
    I assume we're supposed to take constant time comparison as an axiom?
    use Math::BigInt; use Benchmark qw( cmpthese ) ; my $x = (new Math::BigInt 2)**(2**16); my $y = $x - 1; my $m = new Math::BigInt 2; my $n = $m + 1; cmpthese(-1, { large => sub{ $y < $x }, small => sub{ $m < $n } });
      I assume we're supposed to take constant time comparison as an axiom?
      Yes.
      Perl --((8:>*
Re: Puzzle: The Ham Cheese Sandwich cut.
by robin (Chaplain) on Nov 21, 2005 at 19:08 UTC
    As ambrus said, even the warm-up problem is pretty damned hard. I too cheated by looking in Cormen, Leiserson and Rivest. Here is a Perl implementation of the linear-time algorithm they give.
    sub naive_median { (sort {$a <=> $b} @_)[@_/2]; } sub nth_largest { my ($n, @a) = @_; die "You can't find the ${n}th-largest element of an ".@a."-element +array!" if $n > $#a || $n < 0; #warn "Looking for ${n}th element of (@a)\n"; return $a[0] if $n == 0; my @medians; for(my $i=0; $i < @a; $i += 5) { push @medians, naive_median(@a[$i..($i+4 > $#a ? $#a : $i+4)]); } my $median = median(@medians); my @smaller = grep {$_ < $median} @a; return nth_largest($n, @smaller) if $n < @smaller; my @larger = grep {$_ >= $median} @a; return nth_largest($n - @smaller, @larger); } sub median { unshift @_, int(@_/2); goto &nth_largest; }
    In practice it's pretty inefficient, and even proving that it runs in linear time is not entirely trivial!
      That goto is not helpful.

      Caution: Contents may have been coded under pressure.
        Hmm, that's interesting. But I bet the goto is faster if @_ has, say a million elements. You're saving an awful lot of copying. Update: I lost this bet :-)

        It also saves a fair amount of stack space.

      I suspect that a proof of the running time order will concentrate on the expected depth of recursion.

      However I believe it will be much harder to prove that the push is O(1) - indeed I suspect it is not - and without that the algorithm as a whole cannot be O(n).

      Hugo

        No, the proof doesn't need an expected running time. The running time T(N) is expressed as:
        T(N) = T(N/5) + T(7N/10 + 10) + Ο(N);
        which has T(N) = Ο(N) as a solution.
        However I believe it will be much harder to prove that the push is O(1) - indeed I suspect it is not - and without that the algorithm as a whole cannot be O(n).
        It doesn't have to be. What's needed is that the push has an amortized running time of Ο(1) - that is, if we perform N pushes, the total running time is still bounded by Ο(N). And from what I understand of how allocation of array sizes work (an addition extra 20% memory is being claimed), a push has an amortized Ο(1) performance. A single push may take Θ(N) running time, but N pushes average it out.
        Perl --((8:>*
Re: Puzzle: The Ham Cheese Sandwich cut.
by BrowserUk (Patriarch) on Nov 23, 2005 at 04:56 UTC

    If I calculate the median point of both datasets, (using the minimised Euclidian distance method, 2D for now), I get two points, one for each set of colors. These rarely match up with any of the given points.

    If I project the line through those two points, it appears to divide the dataset as required. Is this the correct approach?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I think you're suggesting take four medians: the median of the X coordinates of the reds, the median of the Y coordinates of the reds, and the same for the greens. I don't understand what you might be doing with minimised Euclidean distance, though. Mind explaining? Then maybe I can tell you if it's on track with my aproach (which is NOT yet O(n)).
        I think you're suggesting take four medians: the median of the X coordinates of the reds, the median of the Y coordinates of the reds, and the same for the greens.

        No. The problem is defining (or understanding) what the median is for a 2D dataset (R2).

        Think of 3 points in the form of an equilateral triangle with the lower edge parallel to the X axis.

        + | x | . . | . . | . . | . . | . . | x.............x +-------------------

        Whilst the top point is the median in the X axis (looking up). The bottom right point is the median if you are looking in from the top left. Equally it's the bottom left point, if you look in from top right. Which would be the "correct median" depends upon the relative positioning of the other set of three points; or more correctly, their median. And the above three points can be rotated through 0->120°, giving an infinite number of directions to view the dataset, (or transformations you could apply), in order to access the median.

        Which I think means that the warm-up problem is an almost complete red herring!

        As you cannot work out which direction to look in (or which transformation of the coordinate system to apply), to determine the median for this dataset, until you know the median of the other. And vice versa. You cannot use a 'sort and take the middle' or K'th ordered element approach to determining the median as you would use for an R1 dataset; for an R2 dataset. Nor for the higher dimensions.

        That leads you, (led me?), to think about how to determine the median of a set of points in R2, without reference to the other dataset. And that's when I found the Euclidian distance method.

        The premise is that the median of a R2 dataset is that point at which the sum of the Euclidian distances between that point and the points in the datast is minimised.

        There are other methods, including the point that minimises the sum of the areas of the sets of triangles formed between that point and pairs of points of the dataset, but that seems much harder to calculate.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Puzzle: The Ham Cheese Sandwich cut.
by BrowserUk (Patriarch) on Nov 21, 2005 at 04:01 UTC
    Megiddo

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Indeed.
      Perl --((8:>*

        I don't think I'll be tackling the problem further. I just waded through the 1983 paper, and it'd take me a month of Sunday's to translate it into something I could attempt to produce code from.

        Geez. Haven't these guys ever heard of 'worked examples', or that a picture paints a thousand words :)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Puzzle: The Ham Cheese Sandwich cut.
by tphyahoo (Vicar) on Nov 21, 2005 at 07:53 UTC
    1, was the solution to this ever posted? It seems like in the exchange with ambrus there was a pretty strong hint to the solution for challenge 1, but as to everything else - huh?

    2, for future reference, a pretty unintimidating guide to sorting (just some class notes) is at a sorting

    I'm kind of reading up on that during breaks. I take it that sorting is at the core of the problem space here, but maybe I'm wrong on that. Anyway, thanks for posting an interesting problem. A solution would be nice though :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://509409]
Approved by Limbic~Region
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (7)
As of 2024-04-18 17:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found