Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Dependencies, or, How Common is Regexp::Common?

by legLess (Hermit)
on Sep 17, 2003 at 03:10 UTC ( [id://292033]=perlquestion: print w/replies, xml ) Need Help??

legLess has asked for the wisdom of the Perl Monks concerning the following question:

Monks ~

I'm building a module, one of whose jobs is parsing input. It's been pointed out to me that input parsing has been done before :) and I'm looking for a way to be more constructively lazy about it. At the same time, I want to avoid a disease that has plagued me lately: multiple nested dependencies.

Perl folks I know here in Portland dot the spectrum from "Never use dependencies other than what comes with the default install," to "Who cares if it's a 3-year-old .01?" My own position on that scale varies depending on how obscure the dependency is, how much work it saves me, how much value it adds, and other such things. I've seen other discussions of this issue here, so feel free to skip that part of my question if you're bored with it.

Right now the module depends on CGI and nothing else; the module I'm thinking of using is Damian's (now Abigail's) Regexp::Common.

So my questions:

  1. How common is Regexp::Common? Perlmonk's own host doesn't have it installed by default; Red Hat and Debian don't have packages for it. I don't know about other sites. (see NOTE)
  2. I have an urge to roll this myself: am I in a state of sin?

Here's a simplified snip of the code I have now:

sub valex { my $self = shift; $_ = shift; /^integer$/ and return qr/\d+/; /^word$/ and return qr/\w+/; # etc. return undef; }

Using Regexp::Common would replace the qr//s above. The benefits are a richer and more robust set of regexen: Damian and Abigail have wicked Perl-fu. The drawbacks are added complexity and a dependency of perhaps questionable value (for this application: notice I have no heartache using CGI :).

(NOTE) I don't care how easy or difficult it is for me to install dependencies, but one of the design goals is to avoid interface complexity in common situations. I very much want users of the module to be able to specify plain words as validators: 'integer', 'email', etc. So I'd be wrapping the common regexen with the valex sub anyway, and only using Regexp::Common internally. I have confidence (and so far my test suite agrees with me) that I can write or find these myself.

Thanks for your time and replies.

Replies are listed 'Best First'.
Re: Dependencies, or, How Common is Regexp::Common?
by Abigail-II (Bishop) on Sep 17, 2003 at 13:06 UTC

    "Never use dependencies other than what comes with the default install,"

    The drawbacks are added complexity and a dependency of perhaps questionable value (for this application: notice I have no heartache using CGI :).

    I think such attitudes totally miss the point of Open Source in general, and CPAN in particular. What is the point of sharing software, if people balk at the slightess inconvenience and don't want to use what's available. Does everything have to be delivered to your doorstep?

    It seems that certain people would prefer that Perl comes with everything that's available on CPAN - and then some. Beside that that would make it impossible to ever release a version of Perl again (just look at how hard it is to release 5.8.1, which is partially due to the bloat, and wanting to service everyone).

    It's far better for packages to live on CPAN. Then at least there is the potential that they will be update soon after a bug is revealed. Suppose Regexp::Common came with 5.8.0, and it had a bug. The earliest release that would fix the bug will be 5.8.1, which, if it came out today, would be 14 months after 5.8.0. And if you have a hard time to convince people to install a module, think how hard it's going to be to convince them to install a new version of Perl! And if there would be a bug in Regexp::Common released with 5.8.1, do you have any idea how long you have to wait for a new release? The track to 5.10 was started in July 2002. It's now September 2003, and there isn't even any sign of a 5.9.0. You might have to wait *years* for a bugfix.

    Having said that, Regexp::Common is easy to install. It's a pure Perl module, and I don't have intention to ever turn it into something that isn't pure Perl. All you need to do is (recursively) copy the files in the 'lib' directory of the distribution. How hard can that be? But even if you don't want to install Regexp::Common, there is always the option to copy the code. Of course, your own license may prevent that, and you do have to do more work in case the code you copied gets upgraded, but the license of Regexp::Common allows you to go this way.

    Abigail

      I wonder about Regexp::Common. I often have tasks that a regular-expression related and then I look at what the module offers and usually it doesn't have what I need.

      By "what I Need" I mean 2 things. (a) it lacks a certain common regular expression (b) it lacks a certain tasks related to regular expressions

      By (a), what I mean is sometimes a regular expression is common, but not in that distro. For example, I was told to write something to make sure an address was valid. So, I simply made sure that the string had a number and a letter in it... and it did get a little bit of filtering done. Is there a better solution? Aren't many people having to validate addresses? How are you doing it? Also, I am not sure how open Abigail-II is to new additions to the module and I am not sure if I should use rt.cpan.org or email him. He is certainly very present here, so I could msg him. But also by (a) what I mean is that Abigail and Damian are both non-American, and so their profanity regular expressions were way off the mark. I had never even heard of some of the terms they thought were bad and others are completely normal in American context (e.g, "bl**dy"). So I coded Regexp::US::Profanity to do filtering with

      Regarding (b), the regexp to count the number of a certain character in a string is very simple, and the task to count is also simple, but neither was readily available in the distro. And again, I was afraid to contact the author about it, so I just whipped up some lines of code to do it

      Carter's compass: I know I'm on the right track when by deleting something, I'm adding functionality.

        For example, I was told to write something to make sure an address was valid. So, I simply made sure that the string had a number and a letter in it... and it did get a little bit of filtering done. Is there a better solution? Aren't many people having to validate addresses? How are you doing it?

        Personally, I don't think such a thing belongs in Regexp::Common, because there are no clear rules on what is a valid address. You could make some heuristics, but they will give many false positives, and false negatives. And the heuristics will differ from country to country.

        Also, I am not sure how open Abigail-II is to new additions to the module
        The PODs have always suggested there are not enough regexes and has asked for people to send them. In the year and a half that I'm taken care of this module, I haven't had enough regexes send in to need a second hand to count them.

        As for contacting me, email is preferred (regexp-common@abigail.nl). I don't do the chatterbox, so don't waste your time messaging me.

        As for the profanity regex, that's entirely Damians work, including the nifty encoding. Had it not been there when I started maintaining it, I would not have added. The problem I have with it, is that it's so subjective. Who am I to decide what's profanity, and what isn't? You can never be complete on this one, and where do you stop?

        Regarding (b), the regexp to count the number of a certain character in a string is very simple, and the task to count is also simple, but neither was readily available in the distro.
        The regexp is simple? You'd have to write something like (assuming you want to count the occurrance of the character c:
        /^(?{$count = 0})[^c]*(?:c(?{$count ++})[^c]*)*/
        which I don't think is simple. I wouldn't use a regex for that, I'd use
        tr/c/c/
        and if you want to count the number of non-overlapping matches of a pattern, I'd use:
        $count = () = /$pat/g;
        To catch that inside a single regex is really awkward. Remember that Regexp::Common gives you patterns, that can be interpolated in a regexp. For instance, if you want to count the number of HTTP URIs in a string, Regexp::Common doesn't give you a function to that directly, but it does do the hard work for you, it gives you the pattern:
        $count = () = $str =~ /$RE{URI}{HTTP}/;

        Patches are more than welcome, or even suggestions what to include.

        The next version of Regexp::Common is planned to be released shortly after 5.8.1 comes out. The major addition will be ISBN numbers, checking against the latest country/publisher lists.

        Abigail

Re: Dependencies, or, How Common is Regexp::Common?
by Zaxo (Archbishop) on Sep 17, 2003 at 03:50 UTC

    I'll vote with requiring it. It is easy enough to install on any platform supporting CPAN. If you package CPAN-style with ExtUtils::MakeMaker, you can provide a make rule to go get the distribution, or else say,     PREREQ_PM => { Regexp::Common => '2.00' }, in the attrubute list of WriteMakefile() in Makefile.PL.

    If you don't expect your users to have system privileges, you can make the installation go into some private library and pepper the code with use lib '/my/private/lib';.

    There are, of course, many admins who will refuse all modules beyond core, and even delete as many core ones as they can. There are others who will not install anything their vendor's package mechanism doesn't provide. Them aside, well-stocked perl installation will have a good chance of having it.

    After Compline,
    Zaxo

Re: Dependencies, or, How Common is Regexp::Common?
by bart (Canon) on Sep 17, 2003 at 03:57 UTC
    How reliable do you want it to be? Because right now, in this simple case, you have some major errors. For example, with this snippet:
    /^integer$/ and return qr/\d+/;
    your routine would validate "foo123abc" as a valid value for an integer.

    There's more to it than just being too lazy to install a module, you know. On the other side of the spectrum, there's the laziness of being pretty sure your module will do what you want it to do. A lot of work has been going into construction of these modules. At least, borrow some of that work, copying some of the code into your scripts, instead of reinventing a likely majorly flawed wheel yourself.

    You may think that you have just made a minor error, and that this won't happen to you again. Think twice. These kinds of errors are all too common.

      We had a good one like that a while back. Essentially the code was like this (used to check client ID numbers)

      sub is_integer { return 0 unless $_[0]; return $_[0] =~ m/^\d$/ ? 1 : 0; }

      In development this routine was never required to deal with a TWO digit integer as all the developers used accounts with a <10 client ID number. Oh and the Test code.....the guy that wrote it tested all these args: undef,'', 'I am not an integer 42!', 0,1,2,3,4,5,6,7,8,9. Why ten tests for single digit integers and no tests for 16, 256,65535 GOK. Just goes to show that the volume of test is not the most important thing. Testing all the possible cases is. Even most of the probable cases would have been fine in this case.

      So you can guess what happens. During a live demo client 10 gets created, but client ID 10 is not an integer according to the sub. End of that demo. Much egg on developer and manager faces. And the bug (besides the inadequate test suite) a single missing +

      cheers

      tachyon

      s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      The error is in your assumption of how the code is used. The routine I gave doesn't do any validation at all, it just returns a quoted regex which is used elsewhere in the code to properly untaint incoming data:

      # $value set to incoming parameter earlier # $param object initialized earlier my $valex = $param->valex; $value =~ /($valex)/; $param->errors( "Parameter '" . $param->name . "' contained invalid data." ) unless(( defined $1 ) and ( $value eq $1 )); $param->value( $1 );

      Just for the heck of it I added another test, passing a parameter called 'bart' with a value of 'foo123abc'. I defined 'bart' as an integer and watched this test pass:

      is( $valop->value( 'bart' ), 123, 'foo123abc validated correctly' );

      Thanks for the reply, though.

      UPDATE: I should note that one of the simplifications I made (not expecting the Spanish Inquisition, as they say) to the code in my original post was changing this line in the module:

      $_ = shift || $self->validator;

      to:

      $_ = shift;

      In the module, &valex serves a dual purpose which I didn't think pertinent to the question I was asking.

A point about the regexes that Regexp::Common doesn't supply
by TheDamian (Vicar) on Sep 17, 2003 at 18:51 UTC

    Regexp::Common was originally conceived as a framework; one that would allow the Perl community to share and reuse commonly needed (and frequently poorly implemented) regular expressions.

    And, indeed, it was been successful in that sense. My original version of the module had very few regexes. Others (principally Abigail-II) have contributed most of the current set it offers.

    But because Regexp::Common is contribution-driven, it's entirely possible the module doesn't have the regex you need. Or that you could improve on one of the regexes it does offer (e.g. $RE{profanity}).

    In that case, I would strongly encourage you not just to roll your own, but to integrate it with Regexp::Common (I tried to make that trivial to do). Then send Abigail-II the code, so that everyone can benefit.

Re: Dependencies, or, How Common is Regexp::Common?
by DrHyde (Prior) on Sep 17, 2003 at 15:09 UTC
    I think it's reasonable to depend on other modules - no matter how unusual - without bundling them with your code. It's not as if it's hard to grab modules and their dependencies using the CPAN module. And if the machine you want it on has no direct access to the outside world it's still not hard. I install modules on just such a machine quite often.

    In fact, I would go so far as to say that, in almost all cases, including some random module in your tarball is a BAD thing. Try searching CPAN for Test::More. It appears in several packages. It's at different versions in those packages too. Someone who's bundled it with their code may very well have bundled a buggy version. Better in my opinion to put it in the list of prerequisites for your module so that the CPAN or CPANPLUS module can fetch it for you.

Re: Dependencies, or, How Common is Regexp::Common?
by Anonymous Monk on Sep 17, 2003 at 03:48 UTC
    Not very common I'd say. Besides, its Perl comment matcher is so simplistic it simply matches *any* # character to a newline.
      You must know more than I do. Are there Perl comments that aren't "start at #, continue to the end of the line"? Or are there any strings that if they follow a #, it's suddenly not a comment anymore?

      Note that $RE{comment}{Perl} matches just that, a Perl comment. It's not a parser that extracts comments from valid Perl code. All the regexes from Regexp::Common are context free. Anything that is context sensitive should be coded by the user.

      Abigail

        I think the poster was talking about something like this:

        print " # This is not a comment\n";

        Or that insane Acme::Comment module, but people who use that in a real program deserve what they get :)

        ----
        I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
        -- Schemer

        Note: All code is untested, unless otherwise stated

          A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://292033]
Approved by vek
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (8)
As of 2024-04-25 11:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found