Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Regex to detect file name

by lirc201 (Initiate)
on Jul 05, 2018 at 18:34 UTC ( #1217978=perlquestion: print w/replies, xml ) Need Help??

lirc201 has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying the following:

if($view_tag =~ m/[^A-Za-z0-9_\-\.]/) {

So allowable characters in the file name (inputted for the command line), however I do not want the first character to be anything other than A-Za-z0-9. I've tried several combinations but have not been successful.

Thanks!

Replies are listed 'Best First'.
Re: Regex to detect file name
by Corion (Pope) on Jul 05, 2018 at 18:41 UTC

    I would instead specify what is allowed, which makes for a simpler regular expression in your case:

    if($view_tag =~ m/\A[A-Za-z0-9][A-Za-z0-9_\-\.]+\z/) { # everything is OK } else { die "Invalid/disallowed filename '$view_tag'"; };
      Thank you so much. I actually just changed the =~ to !~ and it works as expected with your regex.
        I actually just changed the =~ to !~ and it works as expected with your regex.

        I don't understand from the OP what you want, and so I don't understand how the logical negation of Corion's regex gives you what you want. Can you elucidate, perhaps with some matching and non-matching example strings? Are you sure you're really matching what you think you're matching? (Please see perlre, perlretut, and perlrequick.)


        Give a man a fish:  <%-{-{-{-<

Re: Regex to detect file name
by kcott (Bishop) on Jul 06, 2018 at 12:12 UTC

    G'day lirc201,

    Welcome to the Monastery.

    You've missed some information which could be important. Is there a minimum number of characters? Filenames can't start with [_.-], but can they end with all, some or none of those?

    Just this week, I implemented something along these lines for production code. The requirements were: names could be just one character long; the start and end characters (the same character for one-character names) must match [A-Za-z0-9]; the middle characters for names with three or more characters must match [A-Za-z0-9_.-]. The regex for this:

    qr{\A[A-Za-z0-9](?:[A-Za-z0-9_.-]*?[A-Za-z0-9]|)\z}

    Note that, in a bracketed character class, '.' is not special and '-' is only special when between two characters to form a range: as you can see, you don't actually need to escape any characters.

    Here's a limited test:

    $ perl -E ' my @x = (qw{A AA AAA _ __ ___ -A A- A.A A. .A A-A A_A}, "A\n", "A\ +tA"); my $re = qr{\A[A-Za-z0-9](?:[A-Za-z0-9_.-]*?[A-Za-z0-9]|)\z}; say "|$_| is ", /$re/ ? "OK" : "BAD" for @x ' |A| is OK |AA| is OK |AAA| is OK |_| is BAD |__| is BAD |___| is BAD |-A| is BAD |A-| is BAD |A.A| is OK |A.| is BAD |.A| is BAD |A-A| is OK |A_A| is OK |A | is BAD |A A| is BAD

    Modify that to suit your own filename specifications. Add some more tests which should probably include digits and lowercase letters.

    — Ken

      Use of POSIX character classes (see perlre, perlrecharclass) and /x can make regexes easier on the eye:
          my $re = qr{ \A [[:alnum:]] (?: [[:alnum:]_.-]* [[:alnum:]])? \z }xms;
      is equivalent.


      Give a man a fish:  <%-{-{-{-<

        is equivalent

        No, it really isn't:

        use strict; use warnings; use Test::More tests => 2; my $in = "\N{LATIN SMALL LETTER C WITH CEDILLA}"; like ($in, qr{\A[A-Za-z0-9](?:[A-Za-z0-9_.-]*?[A-Za-z0-9]|)\z}, 'kcott +'); like ($in, qr{ \A [[:alnum:]] (?: [[:alnum:]_.-]* [[:alnum:]])? \z }xm +s, 'AnomalousMonk');

        G'day AnomalousMonk,

        With regard to the POSIX character class, ++hippo has already pointed out the problem with that. You can certainly be forgiven for that because the documentation appears to be wrong. From "perlrecharclass: POSIX Character Classes":

        Perl recognizes the following POSIX character classes:

        ...

        2. alnum Any alphanumeric character ("[A-Za-z0-9]").

        I rarely use the POSIX classes and wasn't aware of that discrepancy. Anyway, while possibly "easier on the eye", that's likely to result in a fair amount of frustration for someone attempting to perform debugging and assuming the documentation is correct.

        The problem could be further exacerbated when input characters may not appear to be ones that should be failing. While hippo's example using "LATIN SMALL LETTER C WITH CEDILLA" () was fairly obvious, the glyphs for some characters (depending on the font) may be identical or so similar that it's difficult to tell them apart. Consider "LATIN CAPITAL LETTER A" (A) and "GREEK CAPITAL LETTER ALPHA" (Α):

        $ perl -C -E '
            use utf8;
            say "$_ (", ord $_, "): ", /\A[A-Za-z0-9]\z/ ? "✓" : "✗"
                for qw{A Α}
        '
        A (65): ✓
        Α (913): ✗
        
        $ perl -C -E '
            use utf8;
            say "$_ (", ord $_, "): ", /\A[[:alnum:]]\z/ ? "✓" : "✗"
                for qw{A Α}
        '
        A (65): ✓
        Α (913): ✓
        

        As far as the 'x' modifier goes, I don't disagree that it can improve readability; however, where it's felt necessary to use it — either because the regex is particularly complex or it's code that junior developers will need to deal with — spreading the regex across multiple lines and including comments might be even better:

        my $re = qr{ \A # Assert start of string [A-Za-z0-9] # Must start with one of these (?: # Followed by either [A-Za-z0-9_.-]*? # Zero or more of these [A-Za-z0-9] # But ending with one of these | # OR # Nothing ) \z # Assert end of string }x;

        And, with 5.26 or later, perhaps even clearer as:

        my $re = qr{ \A # Assert start of string [A-Z a-z 0-9] # Must start with one of these (?: # Followed by either [A-Z a-z 0-9 _ . -]*? # Zero or more of these [A-Z a-z 0-9] # But ending with one of these | # OR # Nothing ) \z # Assert end of string }xx;

        We've already had exhaustive discussions about the 'm' and 's' modifiers. Use them if you want to follow PBP suggestions but understand that they do absolutely nothing here: there's no '^' or '$' assertions that 'm' might affect; there's no '.' (outside a bracketed character class) that 's' might affect.

        — Ken

Re: Regex to detect file name
by Marshall (Canon) on Jul 05, 2018 at 19:09 UTC
    The Perl regex "shortcut" for A-Za-z0-9 is "\w".

    Show a bit more examples and the Monks will help you write an appropriate regex.

      The Perl regex "shortcut" for A-Za-z0-9 is "\w".
      Yes, this shortcut is probably good enough for the case in point. But, without wanting to nitpick, let me just add that \w also matches the underscore ( _). This would be anecdotic in most cases, but it seems that the OP does not want the string to start with an underscore.
        and unicode
Re: Regex to detect file name
by Anonymous Monk on Jul 06, 2018 at 15:00 UTC

    It's probably beating a dead horse at this point, but this looks like what you're trying to do, with explainations:

    # original if($view_tag =~ m/[^A-Za-z0-9_\-\.]/) { # corrected if($view_tag =~ m/^[A-Za-z0-9][\w.-]*/) { # to anchor the match at the beginning, # the ^ comes immediately after the first /: m/^[A-Z ... # match the first character [A-Za-z0-9] # matching subsequent characters: [\w.-]* # \w is the same as [A-Za-z0-9_] # . does not need to be escaped when inside brackets # put - last inside brackets, otherwise it may indicate a range # the quantifier * matches 0 or more instances of the characters
      # corrected if($view_tag =~ m/^[A-Za-z0-9][\w.-]*/) {
      I think that you need to add an end-of-string anchor at the end of the pattern to prevent matching if there are some unwanted or spurious characters after the last character matching [\w.-].
Re: Regex to detect file name
by Anonymous Monk on Jul 05, 2018 at 20:36 UTC
    it is easy to do with two regexes, such as:
    $f =~ /^[\w\.]+$/ && $f =~ /^\w/
    The first checks what the entire filename consists of, and the second checks the leading letter. There are endless variations on the same idea.
      it is easy to do with two regexes, such as: ...
      Easy, perhaps, but wrong. The suggested two regexes would match a string starting with an underscore.
      DB<1> $f = "_foo"; DB<2> print "Success" if $f =~ /^[\w\.]+$/ && $f =~ /^\w/; Success DB<3>
      Update (at 7:50 UTC): sorry, the last sentence below is wrong, I had missed the $ at the end of the first pattern.

      It would also accept garbage characters in the middle.

      thats a waste of resources
      /^\w[\w\.]*$/
        This is not a game of Name That Tune. There are no brownie-points for doing it in one regex versus two. If it does the job, is clear, and easy to maintain, Just Do It.
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1217978]
Approved by marto
Front-paged by stevieb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2021-10-25 07:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My first memorable Perl project was:







    Results (89 votes). Check out past polls.

    Notices?