Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re^2: how to unicode filenames?

by perl-diddler (Hermit)
on Jun 28, 2012 at 01:04 UTC ( #978808=note: print w/ replies, xml ) Need Help??


in reply to Re: how to unicode filenames?
in thread how to unicode filenames?

I'll certainly give it a try... if it performs well, I'd prefer it over my hack-around of calling 'find' (as perl doesn't have a problem with the filenames if they come in on STDIN)...

do I want to 'decode' utf8? the utf8 page mentions something about using Encode -- basically I don't need it to be decoded as much as to just be "relabeled" in place as already being "UTF-8-ified"...

Thanks again! (which I could give more than one positive vote to people who are really helpful -- considering the vast difference when compared to those who are just contrarian)...


Comment on Re^2: how to unicode filenames?
Re^3: how to unicode filenames?
by zentara (Archbishop) on Jun 28, 2012 at 08:42 UTC
    Hi, I would like to share my Unicode battles with you, since we both are fighting the same battle it seems. After a few unicode related posts, yours being one of them, I decided to try and make a little utility I wrote, named vgrep, unicode aware. It was quite a hit or miss transformation. See Gtk2 Visual Grep

    I has to add the -CS perlrun switch, use the unicode::all module, and even after all that, I still needed to use $Encode::decode() in many places to get the desired output.

    Even though my linux filesystem locale is en_US.UTF-8 in my .bashrc, I still needed to run input strings and filenames thru decode. I'm using Perl 5.14.1.

    It works, but it definitely seems to my sensibilities that it should be simpler. I guess the problem comes from having many files and filenames comng in thru the net, and left over from previous Latin-1 linux installations, which are not UTF-8.

    The general rule I seem to be seeing is "treat all input as binary" then decode. My vgrep program still emits some errors when searching thru pdf files, which are detected as being -t text, but contain binary images; and I don't understand why File::Find dosn't automatically see unicode filenames, without having to decode $File::Find::name.


    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh

      File::Find does not automatically "see" (or return) unicode filenames, because for Perl there is no way to know that what the file system APIs return is UTF-8-encoded text. If you are certain that this is always the case, I guess you can wrap your own decode() wrapper around it, but I see it breaking for many situations where different filesystems with different filename encodings come together.

        Of course it does... ALL of the linux core utils know how -- Perl is just braindead by choice of its creators. cat, chmod, chown, chroot cp, cut, dirname, (all of the file name routines work with UTF-8); tac, wc uniq, sort, sed, awk, grep, Perl's "correctness" used to be measured by weather or not it produced the same output as the core utilities that it was based on. Perl derived from those core utils -- and their behavior set the standard for how perl ran. Perl Fails randomly and often on compatibility with the utils that it was designed to be a combination of. Simple word count program:
        #!/usr/bin/perl -w ## 'pwc' use 5.14.0; my ($l,$w,$c)=(0,0,0); while (<>) { ++$l; $c += length $_; while ( m{^\W*(\w+)(.*)$} ) { ++$w; $_=$2; } } printf "%d\t%d\t%d\n", $l, $w, $c; a text file: > file /tmp/txt /tmp/txt: UTF-8 Unicode text > wc -lwm /tmp/txt 3 5 38 /tmp/txt wc -lwm /tmp/txt > pwc /tmp/txt 3 24 64
        --- (There are 5 words in /tmp/txt, but I can't post it here, as the 'bb-software for perlmonks, like perl isn't UTF-8 safe/compatible).
        It gets closer with an autosplit version:
        (from http://www.catonmat.net/download/perl1line.txt)

        # Find the total number of fields (words) on all lines

        > perl -alne '$t += @F; END { print $t}' /tmp/txt 4
        (it only was off by 1)...

        I could spend weeks detailing all the broken semantics, but it would be a waste of my time...just have to learn all the bugs in perl so you can work around them (as stated in a previous post -- when people told me labeling dysfunctional behavior was the sign of a bad craftsman (i.e. they blame their tools)... which is a meaningless statement considering it is also said that a good craftsman knows their tools (which means 'characterizing it's behavior')....

        So the idea that it is "too hard" for perl to know how to correctly interpret text data is patently and easily, provably false as millions of other programs get it right. Perl's algorithms in this area are governed by ideologues who have beliefs about how the world should be run and enforce them on everyone else. There are multiple examples where they reduce choice -- take away choices from the users because the users are presumed to be too stupid to make their own decisions (yet these same people will complain when MS does similar).

        Perl could be alot more intelligent in alot of areas, than it is -- in some cases it would involve, not implementing code, but ***removing*** code that was added to deliberately limit perl's functionality or to cause erroneous behavior.

        But one can spend all their time pointing out the numerous flaws of the language, or attempt to work around them and get work done. The two are not completely, but to some extent are mutually exclusive as they draw on the same resource: time.

        Until those in charge allow change, it won't happen. And it is a matter of allow -- since one change that was asked for came down to .. well no one who is capable of making the change wants it enough to do it". The proponent of the idea asked "if someone who was capable of making the change, submitted a patch, does that imply there would be no problem adding it into the source base?

        The conversation was terminated at that point as the question was not answerable with a simple yes/no.

      A quick suggestion: You have -CS, but for a 'find', you might want to evaluate if -CSA would be a better choice for such a program.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://978808]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (11)
As of 2014-12-26 09:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (171 votes), past polls