Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Reverse regexp a regexp?

by freakingwildchild (Scribe)
on Feb 12, 2010 at 16:51 UTC ( [id://822894]=perlquestion: print w/replies, xml ) Need Help??

freakingwildchild has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monastery,

I guess my prayers need some answering in a quite diabolical matter ...

(1) To translate a regexp back to a readable value

Is there any "reverse regex" module, code or information available which explains how to get a regexp back to a fully conditioned string in Perl?
Like the regexp /^http:\/\/(?:www.)?youtube.com\/user\/(.*)\/$// could get translated back to an original string like http://www.youtube.com/user/username ?

(2) Translate a set of data to a regexp

To stay with regexes; is there way to create a regexp of different inputs when they look similar ? Like a data set of 10 url's, create a most compatible regexp from this?
This way regexp could be dynamic usable with a set of data; not bothering the user at all to search for the best options...

Hoping to find enlightenment,
FWc.

Replies are listed 'Best First'.
Re: Reverse regexp a regexp?
by blokhead (Monsignor) on Feb 12, 2010 at 17:10 UTC
    Is there any "reverse regex" module, code or information available which explains how to get a regexp back to a fully conditioned string in Perl? Like the regexp /^http:\/\/(?:www.)?youtube.com\/user\/(.*)\/$// could get translated back to an original string like http://www.youtube.com/user/username ?
    This is possible by converting a regex into some sort of finite automaton and doing a traversal on it. At one point I had been working on a suite of modules that would let one manipulate regular expressions & automata in this way, with these kinds of things as a goal (tinkering with perl regexes). Unfortunately, the development has ground to a halt, and we never got quite that far. I know that other programming languages have more developed tools for this kind of stuff, specifically JFLAP for java. It might be worth checking out.

    The regex you give is anchored with ^ and $, so that makes the problem easier -- this is the standard interpretation of "classical" (not Perl) regular expressions, so the techniques from automata theory are more readily applicable. And if all the regexes you care about are as simple as the one you gave, it should be easy to roll your own solution. You'll have to parse the regex, but once you have the regex structure in a recursive data structure, it should be straight-forward. For example, you can replace any (...)? or (....)* with nothing. You can replace a "." with, say, character "x". You can replace any ( ... | .... ) by recursing on just one of the alternations, etc etc..

    To stay with regexes; is there way to create a regexp of different inputs when they look similar ? Like a data set of 10 url's, create a most compatible regexp from this? This way regexp could be dynamic usable with a set of data; not bothering the user at all to search for the best options...
    This problem is a bit underspecified. For example, if you give me any sample of, say, 10 URLs, the regex /./ would match them all, but I doubt that's what you have in mind. I think it may be a challenge to be more precise about what you mean by "most compatible regex". And I think any non-trivial way of defining this problem will result in a very computationally difficult task.

    Here is one way of formalizing the problem that seems natural, but is actually an exceedingly difficult computational problem: given a sample of some URLs that the regex should match, and some others that the regex should not match, find the smallest regex that is consistent. Even getting a good heuristic is NP-complete (minimum consistent regular expression problem).

    blokhead

      Here is one way of formalizing the problem that seems natural, but is actually an exceedingly difficult computational problem: given a sample of some URLs that the regex should match, and some others that the regex should not match, find the smallest regex that is consistent. Even getting a good heuristic is NP-complete (minimum consistent regular expression problem).

      I just finished doing something exactly like this for work. I had 679 strings in a total of 51 groups. I had to write 51 regular expressions to match members of the groups without including members of other groups. I spent about 5 minutes searching on google for someone else's solution before I buckled down and wrote a quick script to help me find them. The guts are here:

      my %data; while (<DATA>) { chomp; my ($v, $k) = split /,/; $data{$k} = $v; } for my $inst ( sort keys %data ) { for my $reg ( sort keys %re ) { if( $data{$inst} eq $reg ) { print "Should match but doesn't: $inst, $data{$inst}, $reg, $re +{$reg}\n" unless $inst =~ $re{ $reg}; } else { print "Is matching but shouldn't: $inst, $data{$inst}, $reg, $re +{$reg}\n" if $inst =~ $re{$reg }; } } } __DATA___ . . .

      %re was a hash containing the group names mapped to regexes. I would run the program in one window, and make corrections in the other. From start to finish took less than an hour.

      This is possible by converting a regex into some sort of finite automaton and doing a traversal on it.

      Perl would be my prefered choice in this though. The use of such would be great in creating examples in how regexes get processed.

      This problem is a bit underspecified. For example, if you give me any sample of, say, 10 URLs

      http://www.youtube.com/user/test1 http://www.youtube.com/user/test2 http://www.youtube.com/user/test3 http://www.youtube.com/user/test4 http://www.youtube.com/user/test5:
      regex created: http://www.youtube.com/user/(.*)

      http://be.apple.itunes.com/blabla/blabla/id/1543285432 http://us.apple.itunes.com/blabla/blabla/id/154328545 http://uk.apple.itunes.com/blabla/blabla/id/5532424342 http://be.apple.itunes.com/blabla/blabla/id/15432 http://be.apple.itunes.com/blabla/id/4324
      regex created: http://(\d)+\.apple.itunes.com/blabla/blabla/id/(.*)

        Reread what Roboticus said. Your first example could just as easily have been http://www.youtube.com/user/test[1-5] and your second could have been http://(u[sk]|be)\.apple.itunes.com/blabla/blabla/id/(\d*)5(\d+).

        How do you make the choices? Once you can answer those questions, you'll be able to make some progress towards what you want.

Re: Reverse regexp a regexp?
by roboticus (Chancellor) on Feb 12, 2010 at 17:01 UTC

    freakingwildchild:

    In general, a regular expression can match an infinite number of different strings, so you'll have to figure out how to constrain the set to something manageable. As a degenerate case, consider the regex /.*/: what are you planning on going backwards to? How do you make the choices? Once you can answer those questions, you'll be able to make some progress towards what you want.

    ...roboticus

      The other degenerate case is /\Qstring1\E|\Qstring2\E|...|\QstringN\E/.
      There are a few examples given in my further posts; most URL structures are SEO friendly, offering an easy way to match differences between them.
      Those subtle differences don't need big code to detect; rather strict rules before the routine is satisfied with it's own results and creativity/knowledge where to start.

      I've checked a few approaches including but not limited to heuristics, fuzzy matching, Regexp::Assemble etc.. but found no real good results yet.

      Currently my best approach is keeping a hash with known sites and their structure, but like to get that hash automatic instead of by hand because the Internet is rather big ;)

Re: Reverse regexp a regexp?
by planetscape (Chancellor) on Feb 12, 2010 at 23:47 UTC
Re: Reverse regexp a regexp?
by CountZero (Bishop) on Feb 13, 2010 at 12:43 UTC
    Higher Order Perl has the solution to (1) in Chapter 6.

    6.5 Regex String Generation

    Here’s a question that comes up fairly often on IRC and in Perl-related newsgroups: Given a regex, how can one generate a list of all the strings matched by the regex? The problem can be rather difficult to solve. But the solution with streams is straightforward and compact.

    Update: added clarification

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Reverse regexp a regexp?
by rubasov (Friar) on Feb 13, 2010 at 00:50 UTC
    I am also not sure what you exactly want to achieve by your second question, however try this:
    #! /usr/bin/perl use strict; use warnings; use Regexp::Assemble; my @urls = qw( http://www.youtube.com/user/test1 http://www.youtube.com/user/test2 http://www.youtube.com/user/test3 http://www.youtube.com/user/test4 http://www.youtube.com/user/test5 ); my $r = Regexp::Assemble->new; for (@urls) { $r->add( quotemeta ); } print $r->re, $/; __END__
    output: (?-xism:http:\/\/www\.youtube\.com\/user\/test[12345])

    And you may also find valuable the String-Approx distribution from CPAN which implements approximate string matching based on Levenshtein distance.

    Hope this helps.
      Very interesting approach of assembling regexes; although this would not generate the results required when inputting real life data like:

      my @urls = qw( http://www.youtube.com/user/jantje http://www.youtube.com/user/pieter http://www.youtube.com/user/frida http://youtube.com/user/channel http://youtube.com/channel/doesntexist http://youtube.com/user/blah http://www.youtube.com/user/bianca_sterlings http://www.youtube.com/user/blango );

      Which returns:

      (?-xism:http:\/\/(?:www\.youtube\.com\/user\/(?:b(?:ianca_sterlings|la +ngo)|jantje|pieter|frida)|youtube\.com\/(?:user\/(?:channel|blah)|cha +nnel\/doesntexist)))

      while it should return roughly: ^http:\/\/(?:www\.)youtube\.com\/user\/(.*)$

      What this module offers is a rather customized approach of how to create regexes on custom user data, for validation; while I'd like to create a regex on public data to extract the right set of data out of a list of url's. Thanks for letting me know this this one; it's a nice toy to play with :D

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://822894]
Approved by johngg
Front-paged by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2024-04-23 21:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found