Hello Monks,

We had a recent question by nikolay about russian encodings that got solved in several ways To decode URL-decoded UTF-8 string.. After replicating the results, and reading that these methods could be achieved with Path::Tiny. I thought, "gosh, I could use this to make cyrillic paths with my html templating system."

I would need to update the cloning script, the main template .pl, and the html template, now html5.pm . For reference, this was my cloning script before I started into this change, and it worked just fine with english used in argv. I'll place abridged output and a source listing between readmore tags.

$ ./11.clone.pl 2.med 1.english 1.pop ------------- making directories abs to template is /home/bob/1.scripts/pages/1.english/template_stuff string abs from is /home/bob/1.scripts/pages/2.med/template_stuff ------------- ... ------------- dollar one is 1 munge is 1.english1.css name is /home/bob/1.scripts/pages/1.english/template_stuff/1.english1. +css return2 is /home/bob/1.scripts/pages/1.english/template_stuff/1.englis +h1.css ------------- matched is /home/bob/1.scripts/pages/2.med/2.med1.pl b is /home/bob/1.scripts/pages/1.english/1.english1.pl return3 is /home/bob/1.scripts/pages/1.english/1.english1.pl end of clone ... $ cat 11.clone.pl #!/usr/bin/perl -w use 5.011; use utf8; use open qw/:std :utf8/; use Path::Tiny; # This script clones the template directory in $1 to $2. # Some names need munging. # $from is a populated child directory; $to is child dir to be create +d. $pop is the folder with the data. my ( $from, $to, $pop ) = @ARGV; my $ts = "template_stuff"; my $current = Path::Tiny->cwd; #say "current is $current"; say "-------------"; say "making directories"; # define the paths within the target directory: my $abs_to = path( $current, $to, $ts ); $abs_to->mkpath; say "abs to template is $abs_to"; # $from template directory: my $abs_from = path( $current, $from, $ts ); say "string abs from is $abs_from"; say "-------------"; say "copying files"; foreach my $child ( $abs_from->children(qr/\.(txt|pm|css|tmpl|pl|sh)$/ +)) { next unless $child->is_file; my $base = $child->basename; #syntax is from to to my $return = path($child)->copy( $abs_to, $base ); if ($base =~ m/\.(pl|sh)$/) { $return->chmod(0755); } say "return is $return"; } say "-------------"; # copy css file to template with munged name foreach my $child ( $abs_from->children ) { my $base = $child->basename; if ( $base =~ m/^$from(\d*)\.css$/ ) { #say "matching is $base"; say "dollar one is $1"; my $munge = $to . "1" . ".css"; say "munge is $munge"; my $name = path( $abs_to, $munge ); say "name is $name"; #syntax is from to to my $return = path( $abs_from, $base )->copy($name); say "return2 is $return"; } } ## munge and copy executable, change permissions say "-------------"; my $d = path( $current, $from ); # @matching will be an array of Path::Tiny objects my @matching = $d->children(qr/$from(\d*)\.pl$/i); @matching = sort @matching; say "matched is @matching"; my $winner = pop @matching; my $newfile = "${to}1.pl"; my $b = path( $current, $to, $newfile ); print "b is $b\n"; # $winner will already be a Path::Tiny object my $return3 = $winner->copy("$b"); say "return3 is $return3"; $return3->chmod(0755); say "end of clone"; my $abs_pop = path( $current, $pop, $ts ); say "string abs pop is $abs_pop"; my $string_pop = "$abs_pop"; foreach my $child ( $abs_pop->children ) { next unless $child->is_dir; say "e is $child"; my $base_dir = $child->basename; say "base dir is $base_dir"; my @dirs = path( $current, $to, $ts, $base_dir )->mkpath; say "dirs are @dirs"; my $pop_from = $child; next if( $child =~ m/logs/); foreach my $pchild ( $pop_from->children ) { say "default is $pchild\n"; my $base = $pchild->basename; say "base is $base"; my $to_name = path( @dirs, $base ); say "to name is $to_name"; my $return4 = path($pchild)->copy($to_name); say "return4 is $return4"; } } my $exec_path = path( $current, $to ); my $return5 = chdir($exec_path); say "return5 is $return5"; system("pwd "); system("ls "); #system ("./$newfile "); $ $

The paint is barely dry on this from its switch to Path::Tiny. The syntax for https://metacpan.org/pod/Path::Tiny#children was new, too. I was trying to use the module methods as much as possible, if only to learn how it works.

I looked at the effective lines in To decode URL-decoded UTF-8 string. and started applying them to any situations where my values looked like a bunch of dominoes. I went at it like a monkey wires a gfci outlet, enumerating the possibilities, sometimes shocking himself, but what I ended up with was writing:

my ( $from, $to, $pop ) = @ARGV; say "argv is @ARGV";

as

my $encoded_from = url_encode($from); say "encoded cyrillic is $encoded_from"; $from = Encode::decode( 'utf8', uri_unescape($encoded_from) ); say "from is $from"; my $encoded_to = url_encode($to); say "encoded cyrillic is $encoded_to"; $to = Encode::decode( 'utf8', uri_unescape($encoded_to) ); say "to is $to"; my $encoded_pop = url_encode($pop); say "encoded cyrillic is $encoded_pop"; $pop = Encode::decode( 'utf8', uri_unescape($encoded_pop) ); say "pop is $pop";

I've made quite a bit of progress without even knowing what the foregoing really means. I simply know when I can read the cyrillic. I chase down the first then next instance of when the dominoes, the percentages, or the janky U creatures appear.

Q1) What does the above syntax do with cyrillic inputs? Why does it not foul up when I use english?

Q2) Without having intended to write the same code 3 times over, what would be a good way to treat @argv analogously in one fell swipe?

14.clone.pl fails to match the .pl executable where it had before. I'm not sure how well this will be represented in the monastery's markup, but I will try. Since it isn't working, it's verbose, so I'll abridge output, list source, then highlight the region that isn't clicking:

$ ./14.clone.pl 2.дом 10.жизнь 1.pop
argv is 2.дом 10.жизнŒ 1.pop
encoded cyrillic is 2.%D0%B4%D0%BE%D0%BC
from is 2.дом
encoded cyrillic is 10.%D0%B6%D0%B8%D0%B7%D0%BD%D1%8C
to is 10.жизнь
encoded cyrillic is 1.pop
pop is 1.pop
encoded cyrillic is %2Fhome%2Fbob%2F1.scripts%2Fpages
current is /home/bob/1.scripts/pages
-------------
making directories
abs to template is /home/bob/1.scripts/pages/10.жизнь/template_stuff
string abs from is /home/bob/1.scripts/pages/2.дом/template_stuff
-------------
copying files
base is 1.cf.pl
return is /home/bob/1.scripts/pages/10.жизнь/template_stuff/1.cf.pl
...
base is code1.tmpl
return is /home/bob/1.scripts/pages/10.жизнь/template_stuff/code1.tmpl
-------------
...
child is /home/bob/1.scripts/pages/2.дом/template_stuff/2.дом1.css
base2 is 2.дом1.css
dollar one is 1
munge is 10.жизнь1.css
name is /home/bob/1.scripts/pages/10.жизнь/template_stuff/10.жизнь1.css
return2 is /home/bob/1.scripts/pages/10.жизнь/template_stuff/10.жизнь1.css
...
--------munge, copy executable
d is /home/bob/1.scripts/pages/2.дом
2.дом1.html  2.дом1.pl.bak  2.дом3.html  2.дом5.html  template_stuff
2.дом1.pl    2.дом2.html    2.дом4.html  2.дом6.html
after match attempt, from is 2.дом
matched is 
b is /home/bob/1.scripts/pages/10.жизнь/10.жизнь1.pl
Can't call method "copy" on an undefined value at ./14.clone.pl line 127.
$ 

# source;

$ cat 14.clone.pl
#!/usr/bin/perl -w
use 5.011;
use utf8;
use open qw/:std :utf8/;
use Path::Tiny;
use Encode;
use URI::Escape;
use URL::Encode qw{ url_decode url_encode};
binmode STDOUT, ":utf8";

#  This script clones the template directory in $1 to $2.
#  Some names need munging.
#  $from is a populated child directory; $to is child dir to be created.  $pop is the folder with the data.
######
## enabling cyrillic
my ( $from, $to, $pop ) = @ARGV;

say "argv is @ARGV";

my $encoded_from = url_encode($from);
say "encoded cyrillic is $encoded_from";
$from = Encode::decode( 'utf8', uri_unescape($encoded_from) );
say "from is $from";

my $encoded_to = url_encode($to);
say "encoded cyrillic is $encoded_to";
$to = Encode::decode( 'utf8', uri_unescape($encoded_to) );
say "to is $to";

my $encoded_pop = url_encode($pop);
say "encoded cyrillic is $encoded_pop";
$pop = Encode::decode( 'utf8', uri_unescape($encoded_pop) );
say "pop is $pop";

my $current = Path::Tiny->cwd;

my $encoded_current = url_encode($current);
say "encoded cyrillic is $encoded_current";
$current = Encode::decode( 'utf8', uri_unescape($encoded_current) );
say "current is $current";
say "-------------";
say "making directories";

# define the paths within the target directory:
my $ts = "template_stuff";
my $abs_to = path( $current, $to, $ts );
$abs_to->mkpath;
say "abs to template is $abs_to";

# $from template directory:
my $abs_from = path( $current, $from, $ts );
say "string abs from is $abs_from";
say "-------------";
say "copying files";

foreach my $child ( $abs_from->children(qr/\.(txt|pm|tmpl|pl|sh)$/) ) {
  next unless $child->is_file;
  my $base = $child->basename;
  say "base is $base";

  #syntax is from to to
  my $return = path($child)->copy( $abs_to, $base );

  if ( $base =~ m/\.(pl|sh)$/ ) {
    $return->chmod(0755);
  }

  say "return is $return";

}
say "-------------";

# copy css file to template with munged name
foreach my $child ( $abs_from->children ) {
  say "child is $child";
  my $base = $child->basename;
  ### added to handle cyrillic
  my $base2 = Encode::decode( 'utf8', uri_unescape($base) );
  say "base2 is $base2";

  if ( $base2 =~ m/^$from(\d*)\.css$/ ) {

    #say "matching is $base";
    say "dollar one is $1";
    my $munge = $to . "1" . ".css";
    say "munge is $munge";
    my $name = path( $abs_to, $munge );
    say "name is $name";

    #syntax is from to to
    my $return = path( $abs_from, $base2 )->copy($name);
    say "return2 is $return";

  }
}

say "--------munge, copy executable";
####### current point of failure with cyrillic in $from
## munge and copy executable, change permissions
my $d = path( $current, $from );
say "d is $d";
system("cd $d; ls");

#  tried this and it's illegal:
#my @matching = $d->children(qr/$from(\d*)\.pl$/i {binmode => ":raw"});
# let's try chomping
#my $from2 = chomp($from);

#my $from2 = Encode::decode( 'utf8', uri_unescape($from) );
#say "from2 is $from2";
#my @matching = $d->children(qr/$from2(\d*)\.pl$/i);
#say "after match attempt, from is $from2";
# Wide character at /usr/lib/x86_64-linux-gnu/perl/5.26/Encode.pm line 212.

my @matching = $d->children(qr/$from(\d*)\.pl$/i);
say "after match attempt, from is $from";
@matching = sort @matching;
say "matched is @matching";

my $winner  = pop @matching;
my $newfile = "${to}1.pl";

my $b = path( $current, $to, $newfile );
print "b is $b\n";

# $winner will already be a Path::Tiny object
my $return3 = $winner->copy("$b");
say "return3 is $return3";
$return3->chmod(0755);
say "end of clone";

my $abs_pop = path( $current, $pop, $ts );
say "string abs pop is $abs_pop";
my $string_pop = "$abs_pop";

foreach my $child ( $abs_pop->children ) {
  next unless $child->is_dir;

  say "e is $child";
  my $base_dir = $child->basename;
  say "base dir is $base_dir";
  my @dirs = path( $current, $to, $ts, $base_dir )->mkpath;
  say "dirs are @dirs";
  my $pop_from = $child;
  next if ( $child =~ m/logs/ );
  foreach my $pchild ( $pop_from->children ) {
    say "default is $pchild\n";
    my $base = $pchild->basename;
    say "base is $base";
    my $to_name = path( @dirs, $base );
    say "to name is $to_name";
    my $return4 = path($pchild)->copy($to_name);
    say "return4 is $return4";

  }

}
my $exec_path = path( $current, $to );
my $return5 = chdir($exec_path);
say "return5 is $return5";
system("pwd ");
system("ls ");

#system ("./$newfile ");

$ 


That seems to render faithfully: точное представление. It would seem that using the pre tags is the right way to go if you don't want to shred your unicode. Meanwhile, you can use normal cyrillic between p tags. Put it in c tags, and you get the ampersand and hashtag coding, which seems to be yet another encoding:

$ pwd /home/bob/1.scripts/pages/2.дом $ ls 2.дом1.html 2.дом1.pl.bak 2.&#10 +76;ом3.html 2.дом5.html template_stuf +f 2.дом1.pl 2.дом2.html 2.&#10 +76;ом4.html 2.дом6.html

I present with the problem of not matching a file. You, the crack diagnosticians at the monastery are going to wonder if the file I seem to unable to match against exists. It does. Cyrillic files present challenges, in particular if you only have an english keyboard. You will see that in this directory that there is a perl executable that matches the directory's name, and that per design.

$ pwd
/home/bob/1.scripts/pages/2.дом
$ ls
2.дом1.html  2.дом1.pl.bak  2.дом3.html  2.дом5.html  template_stuff
2.дом1.pl    2.дом2.html    2.дом4.html  2.дом6.html
$ cd template_stuff/
$ ls
...
2.дом1.css      config3.tmpl    html5.pm                 utils2.pm
3.irs.sh        football1.pm    nibley1.pm
$ 

Now is when I would like to pull out a self-contained example, but I'm still working on making an archive of it, and I'm right at the edge of what I can pull off, so I can't share an executable that operates in a filesystem. But I think you can work with the same ideas given that you have a couple words to start with, say, дом and жизнь .

I just got the munged .css file to copy. It only wanted to be decoded and threw an error if I treated it as I've treated other scalar values above. What remains is the executable file. File::Path is not finding the preimage, as it does not match the word I'm giving it. The words look the same to me.

--------munge, copy executable
d is /home/bob/1.scripts/pages/2.дом
2.дом1.html  2.дом1.pl.bak  2.дом3.html  2.дом5.html  template_stuff
2.дом1.pl    2.дом2.html    2.дом4.html  2.дом6.html
after match attempt, from is 2.дом
matched is 
b is /home/bob/1.scripts/pages/10.жизнь/10.жизнь1.pl
Can't call method "copy" on an undefined value at ./14.clone.pl line 127.
$ 

Relevant script part:

say "--------munge, copy executable";
####### current point of failure with cyrillic in $from
## munge and copy executable, change permissions
my $d = path( $current, $from );
say "d is $d";
system("cd $d; ls");

#  tried this and it's illegal:
#my @matching = $d->children(qr/$from(\d*)\.pl$/i {binmode => ":raw"});
# let's try chomping
#my $from2 = chomp($from);

#my $from2 = Encode::decode( 'utf8', uri_unescape($from) );
#say "from2 is $from2";
#my @matching = $d->children(qr/$from2(\d*)\.pl$/i);
#say "after match attempt, from is $from2";
# Wide character at /usr/lib/x86_64-linux-gnu/perl/5.26/Encode.pm line 212.

my @matching = $d->children(qr/$from(\d*)\.pl$/i);
say "after match attempt, from is $from";
@matching = sort @matching;
say "matched is @matching";

my $winner  = pop @matching;
my $newfile = "${to}1.pl";

I commented out a few of my tries to get this thing to match. If they were strikes, I'd be out. So I'd like to humbly ask for help trying to understand why this works at all, and also why this fails to match as the executable is being made. Thank you for your comment.


In reply to regex help with unicode and [Path::Tiny] by Aldebaran

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.