http://www.perlmonks.org?node_id=11147276


in reply to Removing unwanted chars from filename.

I would strongly recommend Corion's Text::CleanFragment.

As for your regex, note that [:ascii:] is defined as "Any character in the ASCII character set", and the string you've shown here is entirely ASCII, so your code is "working". Perhaps you meant s/[^[:alnum:]]//g or e.g. s/[^[:alnum:]._-]//g instead? (Update: and though tr/A-Za-z0-9._-//cd should be faster, the above module handles Unicode well, so that's why I'd still recommend that)

Replies are listed 'Best First'.
Re^2: Removing unwanted chars from filename.
by kcott (Archbishop) on Oct 06, 2022 at 22:40 UTC

    G'day haukex,

    "... (Update: and though tr/A-Za-z0-9._-//cd should be faster, the above module handles Unicode well, so that's why I'd still recommend that)

    I wasn't aware that transliteration would have a problem with Unicode. Here's a quick test:

    $ perl -Mutf8 -E '
        my $s = " abc \t ©︎ αβ гдж سشص ᚠᚢᚸ ⎈ ☂  .png";
        $s =~ tr/A-Za-z0-9._-//cd;
        say $s;
    '
    abc.png
    

    I'm using Perl v5.36; are there issues with earlier versions?

    I tested with a fair selection of Unicode characters but, obviously, I can't reasonably test them all. Are there problems with Unicode characters I didn't test?

    — Ken

      I was referring to the fact that the tr simply clobbers all Unicode characters, while Text::CleanFragment uses Text::Unidecode to try to turn them into ASCII:

      use warnings;
      use strict;
      use utf8;
      use Text::CleanFragment;
      
      my $s = "Hello.txt";
      print clean_fragment($s), "\n";  # prints "Hello.txt"
      $s =~ tr/A-Za-z0-9._-//cd;
      print "<$s>\n";  # prints "<>" !
      

      (I've actually encountered filenames similar to the above in the wild)

        Thanks for the clarification.

        — Ken