http://www.perlmonks.org?node_id=11147275

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!

I am trying to get rid of unwanted chars from a filename, cant get the sub "filter" to work, any better way for doing this?
#!/usr/bin/perl use strict; use warnings; # Check filename my $file_name = "xTest-1 [ ] 'copy'.png "; my $input = filter( $file_name) || ''; print "\n*$input* \n"; ### Looking to get this: xTest-1copy.png sub filter { my $str = shift || ''; $str =~ s/[^[:ascii:]]//g; # strip everything but ASCII characters #$str =~ s/[^!-~\s]//g; return $str; }
Thanks for looking!

Replies are listed 'Best First'.
Re: Removing unwanted chars from filename.
by haukex (Archbishop) on Oct 06, 2022 at 17:39 UTC

    I would strongly recommend Corion's Text::CleanFragment.

    As for your regex, note that [:ascii:] is defined as "Any character in the ASCII character set", and the string you've shown here is entirely ASCII, so your code is "working". Perhaps you meant s/[^[:alnum:]]//g or e.g. s/[^[:alnum:]._-]//g instead? (Update: and though tr/A-Za-z0-9._-//cd should be faster, the above module handles Unicode well, so that's why I'd still recommend that)

      G'day haukex,

      "... (Update: and though tr/A-Za-z0-9._-//cd should be faster, the above module handles Unicode well, so that's why I'd still recommend that)

      I wasn't aware that transliteration would have a problem with Unicode. Here's a quick test:

      $ perl -Mutf8 -E '
          my $s = " abc \t ©︎ αβ гдж سشص ᚠᚢᚸ ⎈ ☂  .png";
          $s =~ tr/A-Za-z0-9._-//cd;
          say $s;
      '
      abc.png
      

      I'm using Perl v5.36; are there issues with earlier versions?

      I tested with a fair selection of Unicode characters but, obviously, I can't reasonably test them all. Are there problems with Unicode characters I didn't test?

      — Ken

        I was referring to the fact that the tr simply clobbers all Unicode characters, while Text::CleanFragment uses Text::Unidecode to try to turn them into ASCII:

        use warnings;
        use strict;
        use utf8;
        use Text::CleanFragment;
        
        my $s = "Hello.txt";
        print clean_fragment($s), "\n";  # prints "Hello.txt"
        $s =~ tr/A-Za-z0-9._-//cd;
        print "<$s>\n";  # prints "<>" !
        

        (I've actually encountered filenames similar to the above in the wild)

Re: Removing unwanted chars from filename.
by hippo (Bishop) on Oct 06, 2022 at 18:40 UTC

    If you are stripping out all characters from a known set then tr is the way to go for 2 reasons. Firstly, it's lightning fast. Secondly you cannot accidentally construct a pattern of more than a single character. Here is a test to demonstrate.

    #!/usr/bin/env perl use strict; use warnings; use Test::More tests => 1; my $in = q/xTest-1 [ ] 'copy'.png /; my $want = 'xTest-1copy.png'; my $have = filter ($in); is $have, $want; sub filter { my $str = shift or return ''; return $str =~ tr/A-Za-z0-9.-//cdr; }

    🦛

      sub filter { my $str = shift or return ''; return $str =~ tr/A-Za-z0-9.-//cdr; }

      The my $str = shift or return ''; statement will cause a file name of '0' to be converted to the empty string.

      An alternative to avoid this problem is my ($str) = @_ or return '';

      While such a file name seems unlikely to be encountered in the wild, it's best to be prepared. :)


      Give a man a fish:  <%-{-{-{-<

        Good catch! When composing my test I initially had that line as

        my $str = shift // '';

        but the idea of doing the rest of the processing (however swift) against the empty string rankled so I opted for the short-circuit instead. Should have just left it well alone :-)


        🦛

Re: Removing unwanted chars from filename.
by harangzsolt33 (Chaplain) on Oct 08, 2022 at 00:40 UTC
    But you want to include plain ASCII characters that are legal though. No?

    $FILENAME =~ tr| A-Za-z0-9\!\$\#\%\&\^\`\@\_\-\+\=\~\.\,\;\(\)\[\]\{\} +\/\\||cd;

      No. Our anonymous friend clearly wants to remove all the whitespace and the square brackets, both of which your operation keeps.

      What's the story with all those backslashes, BTW?


      🦛

        Well, I know that the tr operator expects a list of characters, but some of them have special meaning such as the minus sign. Where perlop.html talks about the tr operator, it says,

        "A character range may be specified with a hyphen, so tr/A-J/0-9/ does + the same replacement as tr/ACEGIBDFHJ/0246813579/. For sed devotees, + y is provided as a synonym for tr. If the SEARCHLIST is delimited by + bracketing quotes, the REPLACEMENTLIST has its own pair of quotes, w +hich may or may not be bracketing quotes, e.g., tr[[A-Z]][[a-z]] or t +r(+\-*/)/ABCD/."

        I don't understand what this means. I just know that certain characters have special meanings following the tr operator, so to be on the safe side, I just put a backslash before each.