Beefy Boxes and Bandwidth Generously Provided by pair Networks Frank
Perl: the Markov chain saw
 
PerlMonks  

Multidimensional regular expressions

by diotalevi (Canon)
on Nov 17, 2002 at 15:55 UTC ( #213562=perlmeditation: print w/ replies, xml ) Need Help??

I asked in the chatterbox about multidimensional regular expressions and was given a referral to a clean white room with rubber walls. I'm hoping that by putting this into a proper node that the question will get wider exposure and maybe someone has already implemented this. I did a cpan/google search and didn't come up with anything that really matches. At its heart a regular expression is just a series of concise set of assertions joined by AND/OR junctions (with side effects). There's no particular reason that it doesn't exist already. All I can think of is that no one has needed it before. It'd be interesting to take a whack at the idea. (this is where you post a pointer to th e right CPAN module)

Consider first that something m/(?:this|that)/ is a basic one dimensional expression. An AoA is a two dimensional expression, AoAoA is three dimensional, etc. (just for kicks think about locally expanded dimensions or what it'd mean to mix hashes in). I'm not even sure how to best describe the grammar for this but then if someone else has already implemented it then I don't have to (I don't have to anyway but that's not the point). I'm initially thinking of something like \[{ ... } and \]{ ... } (reading the overload section of perlre leads me to think this syntax is ok). So before I get any further with this I'd like to know if this already exists somewhere or if it doesn't exist for a reason. I'm also not sure if there is a way to rebind the currently executing regular expression to another string. I can work around that by use of (??{}) but I'd rather just not resort to that hack.

This is very contrived example for how this might be used. So far most of the data I work with is distinguished by being in different fields which sort of removes the point to this technique. In general though - imagine you were going to do a pattern match against a multidimensional bitmap. I /think/ this has applications there. Or maybe not. It's an idea anyway and if it's just oddball I'm interested in hearing why.

Update 0: It occurs to me that you all might more mileage out of this if I explain my original inspiration. A few weeks ago John M. Dlugosz was talking about unifying substr, splice, shift, unshift and other array functions with the string functions. The problem is, once you start treating strings like arrays then people like me start wanting to treat arrays like strings which is why this even occured to me.

Update 1: I think the main problem with this is rebinding the running regex with another string. You can play tricks like (?(?{more expressions...})continue in this expression|(?#fail)(?!)) but that doesn't quite strike me as a good idea.

Update 2: Taking into consideration both merlyn and my response to princepawn I think the basis of this ought to be a metasequence like \[{dimension,direction} for switching into a different dimension (like a tangled ball of string) and \]{//xpath/expression} for the original idea of skipping around in the data. The first metasequence is probably the most conservative in that all it's doing is adding right angles to regex. The second metasequence is more interesting in that it would allow you to specify a location to jump to. Perhaps somewhat like setting pos() while in the middle of an expression

Update 3: I'd just like to note that the use of the sequences \[{...} and \]{...} is entirely arbitary and just based off of \N{...}. If you have a better syntax then please speak up.

@matches = [ "LISTOP", "OP", "COP", "BINOP", [ "LOOP", [ "OP", "UNOP", [ "OP", "UNOP", [ "SVOP" ], ], ], "UNOP", [ "LOGOP", [ "OP", "LISTOP", [ "COP", "LISTOP", [ "OP", "UNOP", [ "SVOP" ], ], "OP", "COP" ], ], ], ], ] =~ m[(SVOP\[{[-1,-1]}UNOP)]g; print Dumper(\@matches); $VAR => [# match 0 [ "UNOP", [ "SVOP" ] ], # Match 1 [ "UNOP", [ "SVOP" ] ] ]; # empty those end SVOP strings s[(?<=UNOP\[{[0,1],[1,1]})SVOP][];

Comment on Multidimensional regular expressions
Select or Download Code
Re: Multidimensional regular expressions
by diotalevi (Canon) on Nov 17, 2002 at 16:08 UTC

    I'd like to add that just before posting this I was starting with some ideas regarding Quantum::Superposition and the any(),all() junctions to express what might happen if the \[{} sequence left the engine at a non-terminal. I don't yet know how that additional functionality should be expressed or what it would mean.

    __SIG__ use B; printf "You are here %08x\n", unpack "L!", unpack "P4", pack "L!", B::svref_2object(sub{})->OUTSIDE;
•Re: Multidimensional regular expressions
by merlyn (Sage) on Nov 17, 2002 at 16:15 UTC
Re: Multidimensional regular expressions
by princepawn (Parson) on Nov 17, 2002 at 19:18 UTC
    imagine you were going to do a pattern match against a multidimensional bitmap
    Please comment on my Array::PatternMatcher... it seems somewhat up this alley.

    thanks!

    Carter's compass: I know I'm on the right track when by deleting something, I'm adding functionality

      You're right it is. Where we differ is that your Array::PatternMatcher separates extractions from assertions and only borrows from the regex syntax. It's the sort of thing that we all know full well can be implemented by adding code. I'm looking at it sort of like exchanging m/^\d+$/ for sub is_numeric { for (split //) { return unless ord($_) >= ord('0') and ord($_) <= ord('9') } return 1 }. Or at least that's how it looks on the surface after reading your POD once.

      I'm just thinking it'd be a big win to be able to specify regular expressions in more than one dimension. Following on my original meditation I think perhaps I mislead myself. I think the proper multi-d regex is only going to have a new operator such that it tells the engine to go perpendicular and in which direction. A 2-d example would be an instruction to turn right at a certain point. 3-d might mean to start going up. I think the rest of the meaning makes the most sense when you think of multi-d as a tangled one-d space. A normal regex already goes one character/byte at a time, there's no reason to drop that (and that is what I originally proposed with the idea of skipping around). So maybe that means giving the \[{dimension,direction} sequence *just* a dimension and direction parameter.

      Initially I thought I'd need to rebind the regular expression at will to hop around to different array elements. I think instead... maybe when the engine isn't doing a normal 1-d regex that it'd need to run custom code that isn't regex at all to follow the data around. So this brings up the question - I've not seen a Regex tokenizer around (not that I've looked that hard either). Maybe with one of those an a controlling function you could jump around, test atoms and when possible use the normal regex engine.

      If this isn't clear then someone please let me know.

      __SIG__ use B; printf "You are here %08x\n", unpack "L!", unpack "P4", pack "L!", B::svref_2object(sub{})->OUTSIDE;
        Initially I thought I'd need to rebind the regular expression at will to hop around to different array elements. I think instead... maybe when the engine isn't doing a normal 1-d regex that it'd need to run custom code that isn't regex at all to follow the data around. So this brings up the question - I've not seen a Regex tokenizer around (not that I've looked that hard either). Maybe with one of those an a controlling function you could jump around, test atoms and when possible use the normal regex engine
        Do you need DFA or NFA regexes to do this? You know that Perl regexes are regex directed not string directed dont' you? In fact, I am just saying something. I still dont know what in the world you want to do. I'm just grabbing at straws and brainstorming.

        Carter's compass: I know I'm on the right track when by deleting something, I'm adding functionality

Re: Multidimensional regular expressions
by John M. Dlugosz (Monsignor) on Nov 18, 2002 at 20:55 UTC
    Earlier I mused that with Perl 6 "patterns", the uniformity between grammar rules and package subs could be even more unified, by allowing any subs to "fail" and "backtrack". Then you have a forward-chaining logic language like Prolog.

    You could, I beleive, code functions for your primitives that indeed access data in different parts of a data structure and return pass/fail, but then string them together in a Rule to provide for the backtracking.

    Second, consider expanding out any non-cyclic data structure into a long string, like what Dumper does. Armed with regex tools that include Ballanced Parens etc., this can be parsed with grammer rules, right? So where's the other dimentions, exactly?

Re: Multidimensional regular expressions
by MadraghRua (Vicar) on Nov 19, 2002 at 22:00 UTC
    It strikes me that something like this has already been encountered in bioinformatics. Try out Algorithms on Strings, Trees and Sequences by Dan Gusfield or Computational Molecular Biology by Pavel Pezner.

    So in bioinforamtics, an individual sequence can be considered an array. Comparing a pair of sequences can be like comparing an array with an array. The dimensionality increases with the number of sequences that you want to compare. Typically a two dimesional comparrison is carried out n-1 times on the data set to perform an initial comparison, resulting in a statistical score. You then pop the initial query sequence from the data set and carry through the comparrison with the remaining sequences until you have only one left in the set. The statistical score is used to sort the results in terms of relatedness.

    This might then be represented as a tree of sequences with branches and proximity indicating closeness of similarity, or a multiple sequence alignment where the distance of two sequences from each other in the alignment indicates their degree of similarity. You might look into a program called ClustalW for some examples of how this is done.

    I hope this adds some fuel to your fire.

    MadraghRua
    yet another biologist hacking perl....

Re: Multidimensional regular expressions
by QM (Vicar) on Sep 20, 2005 at 20:56 UTC
    Were you thinking of something along the lines of Befunge, but for regexen?

    Merging a Befunge-like syntax with Perl regex syntax would be amazing. Maybe some interesting alternations such as

    \d{?[^>]}
    to look north and east from the "current" position for a digit.

    Using NFA ideas, split the "current" position, and check more than one path. For example, look for palindromes in 1D strings:

    m/( # capture whole match (.)? # 0 or 1 middle char ( # capture matched pairs . # 1 char {?\2[<]} # ...left of \2 {?[eq]} # string equal {?\2[>]} # ...right of \2 )* # zero or more times ) # close capture /x;
    except that it needs some way to refer to the growing match at each point.

    Solving those path problems, like "you can only move to postions which are red on odd moves, and blue or green on even moves", would be easier.

    This would need to capture not only the characters matching the regex, but the path as well, since it can no longer be assumed that the chars are in a straight line. Also, can characters already matched be matched again?

    This fundamentally depends on what it means to have a multideminsional string. Is a 2D string an array of strings taken in order from $x[0] to $x[$#x]? Is a 3D string an AoA of strings? Or is a 2D string just a normal string with more than one newline? (Then what would a 3D string look like?)

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

Re: Multidimensional regular expressions
by Anonymous Monk on Sep 20, 2005 at 21:27 UTC
    Here's an attempt a better syntax...
    #set up our 2D array... $target[0] = "Do What I Mean Please."; $target[1] = "Red Fish, Blue Fish."; $target[2] = "One, Two, Three, Four."; $target[3] = "Alpha Bravo Charlie Delta"; @target =~ s/(ish)/ / /(Two)/ uc $a /e; # now @target is... # $target[0] = "Do What I Mean Please."; # $target[1] = "Red FISH, Blue Fish."; # $target[2] = "One, TWO, Three, Four."; # $target[3] = "Alpha Bravo Charlie Delta"; @target =~ s /ea/ZZ/ /lu/ZZ/ /hr/ZZ/; # now @target is... # $target[0] = "Do What I MZZn Please."; # $target[1] = "Red FISH, BZZe Fish."; # $target[2] = "One, TWO, TZZee, Four."; # $target[3] = "Alpha Bravo Charlie Delta";
Re: Multidimensional regular expressions
by eric256 (Parson) on Sep 20, 2005 at 22:46 UTC

    Quite an old node....but you linked to it so i'm gonna ask anyway.

    What in the world do you mean by multi-dimensional regex? Can you give some example of this in use? Your bitmap mention is interesting if you are thinking of storing an image as AoA and then running a regex agianst it. I can't even imagine the logic, but being able to build pattern recognition like that could be something out of this world. ;) Or it might be a heck of a lot hard to program than a trusty old NN or something like that. Dunno, still trying to wrap my head this.


    ___________
    Eric Hodges

      Imagine a string that a string is a 1d list. Now imagine whata 2d list looks like (just an AoString, I suppose). Now imagine that a regular expression could be written against that 2d string. Its a simple extension of things you're already familiar with.

        I have sometimes thought about generic regular expressions that could work on any sequence, not just a sequence of characters (useful for writing correlation rules, etc.). Today it also struck me that it should be possible to apply the same concepts to more than one dimension. I started to address it as a toy project, began thinking through what it would really mean, and decided it was getting a bit deep, to say the least. So I thought I might use google to find what else is out there. This little short discussion was the top hit. And though I'm a perl fan, I wasn't even thinking about or looking for anything related to perl.

        Syntax is a completely separate problem to solve. You can address the multi-dim regex problem by thinking in terms of the fundamental components of a regular expression and how it functions with those components. So, for example, you have stuff like these, which could be generalized to no longer involve characters:

        • Atoms (take up positions in the sequence, make up the lowest-level of rudimentary comparison -- recursion stops here)
        • Groups (both capturing and non-capturing, the point is to allow all the regex-matching features to be divided up into a heirarchy of pieces)
        • Quantifiers (applied to atoms or groups -- and these are usually as simple as a range from minimum-required to maximum-needed)
        • Assertions (typically are considered as tied to positions between atoms, or on the ends of a sequence, and the assertions do not advance position during processing except for their own temporary, internal matching evaluation)
        • Etc.

        So you would still have all of that in generalized regular expressions, and in multi-dimensional ones as well.

        The dimensionality aspect really does have to do with the directionality of advancement of the matching position within the target sequence. I had thought of this independently before coming across this very old posting, and I think it really is a key insight. For any particular search space (regardless of the type of the atoms), you could build a regular expression that includes a new type of component, which is a direction change. It serves as both an assertion (that it is possible at that position to move in the indicated direction), and as a control for the matching engine to advance position differently.

        Direction changes need not limited to right-angles in an array (of any dimensionality). Instead, a direction change could apply to any possible directed relationship among atoms. Consider a tree structure, for example. What direction is the default direction? Perhaps there isn't one. Maybe a "direction" in this case is really an entire set of possible relationships, each of which might need to be tested, such as moving to a child node. Some of that sort of multi-dimensional relationship path has already been addressed in the world of XPath expressions, which fundamentally deal with a tree structure.

        I should be able to have a search space that defines the fundamental operations: comparing two atoms, evaluating an assertion at a given position, enumerating all possible directions of a given type at a given position, and advancing in some specific direction at a given position (remember that the possibility should be tested first as an assertion, but with multiple possibilities, one need only have a non-empty set). All of this being defined, the regular expression could be built piece by piece using the same structure we use today for character strings, using function calls on a builder object if no syntax is invented to make it simpler. The point of all this is that, once you've built all that, the regular expression matching engine should still work pretty much the exact same way that string-based regex matching does. It's all the same thing.

        This is what I think "multi-dimensional regular expressions" are. Now, if anybody is still out there seeing this ancient thread, does it exist anywhere? :-)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://213562]
Approved by broquaint
Front-paged by hsmyers
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (8)
As of 2014-04-19 22:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (485 votes), past polls