|Problems? Is your data what you think it is?|
decomposing binary matrices (2)by hv (Parson)
|on Feb 27, 2007 at 18:20 UTC||Need Help??|
hv has asked for the
wisdom of the Perl Monks concerning the following question:
Update: I now have a working algorithm in Re^2: decomposing binary matrices (2) below; the rest is a SMOO.
After getting a variety of useful leads in the discussion at decomposing binary matrices, I've thought about this a bit more. I think I can simplify the problem a bit by splitting into two parts - first cleaning the matrix up to remove all the useless '1'-bits, and then (if that still seems useful) to split the matrix into discrete submatrices.
To recap and amplify the problem: I have a matrix of booleans in which rows represent variables and columns represent values, in which a value of TRUE (1) means "this variable may take this value", and FALSE (0) means "this variable cannot take this value". (Note: the rows/columns were transposed in my original post.) We have the additional constraint of uniqueness - the value of each variable is distinct from that of every other value.
The task is a) "cleanup" - to gain additional information from the uniqueness constraint, by identifying TRUE values for variable/value combinations that cannot form a part of any solution, and setting them to FALSE; b) "splitting" - to simplify the matrix (if possible) by splitting it into two or more disjoint submatrices.
Both of these examples represent the same gleaned information: due to the uniqueness constraint, A and B must between them consume the values 1 and 2, so C and D are constrained between them to the values 3, 4 and 5.
In my original formulation I reversed the order of these tasks - I hoped to find an easy way to split the matrix, and have the additional information to be gained from the uniqueness constraint "fall out" as a result. I suspect now that splitting is a hard problem, and may be made easier by doing the cleanup first.
Here is some pseudocode for a recursive procedure to solve the cleanup problem:
The idea here is, for each TRUE value in the original matrix, see if there is any consistent solution that assigns a unique value to every variable; if not, mark it as FALSE instead. Short-circuit some of the work by recording every assignment used whenever a consistent solution is found: none of those needs to be retried by the top level.
I also plan an additional optimisation (not shown) such that if in try_solution() we pick a row that has no TRUE values in the available columns, we signal that fact to the caller - assuming that the caller is also try_solution(), it can then switch to trying that row instead (or propagate further up the call chain).
The biggest savings seem to come from getting a value set in m2 before a point is ever tried in clean(), and there are a couple of thigs I can do to try and maximise the number of times that happens - maybe by selecting a random row in try_solution(), or by preferring not-yet-seen columns, or by seeding it with a batch of random solutions.
There may also be benefits to be gained from an initial transformation of the matrix, to order both the rows and the columns in increasing order of the number of TRUE values.
As to how to implement this, I'm not sure. I suspect PDL might be a good route to go, but I've never used it, and am not sure where to start with it; I also can't see that is has support for a datatype any smaller than "byte", so it may not be appropriate to use in this case. Failing that, I think I might be able to achieve reasonable efficiency encoding the matrix as a list of bit vectors, and making the 'plist' a structure that caches the list of rows and the list of columns used so far.
Even so, the combinatorial explosion may make the whole thing senseless - my initial target to try this out on (after testing) is a 26 x 300 matrix, with a potential number of solutions around comb(300, 26) =~= 2E37. I also hope therefore to identify characteristics of the starting matrix that mean no gain is possible, so that I can skip the effort entirely in the worst cases.
If I can get this working at all usefully, I'll look at the splitting problem again.
Any suggestions for implementation or a better algorithm would be gratefully received.