Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

encoding of file names

by amir_e_a (Hermit)
on Mar 25, 2010 at 19:44 UTC ( #830948=perlquestion: print w/ replies, xml ) Need Help??
amir_e_a has asked for the wisdom of the Perl Monks concerning the following question:

I have a problem with encoding of file names on Ubuntu.

I am using glob to get a list of file names that include a certain string, slurp each file's contents to a variable, remove the file's extension using s///, and then i am trying to use MediaWiki::API->edit to upload the contents to a Wikipedia page whose title is the file's name without the extension. The file name and its contents include Hebrew characters; the content is utf8, but i am not sure about the file name.

The content comes out correctly at the target page, but the the page title is gibberish. What can i do to make the file name proper utf8, as the file's content?

Here's the relevant code:

#!/usr/bin/perl use 5.010; use strict; use warnings; use open ':encoding(utf8)'; use utf8; use English qw(-no_match_vars); use Carp qw(croak cluck); use MediaWiki::API; my $INPUT_EXTENSION = 'wiki.txt'; my $mw = MediaWiki::API->new(); $mw->{config}->{api_url} = "http://he.wikipedia.org/w/api.php"; $mw->login( { lgname => 'Amire80', lgpassword => 'secret80', # not really } ) or croak $mw->{error}->{code} . ': ' . $mw->{error}->{details}; my $page_prefix = 'User:Amire80'; my $dirname = './out.he/'; # in the next line the word 'category' is actually supposed to be # written in Hebrew characters, but this website doesn't seem # to like it my @filenames = glob "${dirname}category*.$INPUT_EXTENSION"; foreach my $filename (@filenames) { my $pagename = $filename; $pagename =~ s/\A $dirname//xms; $pagename =~ s/\.$INPUT_EXTENSION \z//xms; $pagename = "$page_prefix/$pagename"; say $pagename; my $ref = $mw->get_page({ title => $pagename }); if ($ref->{missing}) { say "page $pagename is missing, trying to create"; } my $timestamp = $ref->{timestamp}; local $INPUT_RECORD_SEPARATOR; open my $file, '<', $filename or croak "Can't open $filename: $OS_ERROR"; my $text = <$file>; close $file; $mw->edit( { action => 'edit', title => $pagename, summary => 'cat 001', basetimestamp => $timestamp, # to avoid edit conflicts text => $text, }, { skip_encoding => 1, } ) or croak $mw->{error}->{code} . ': ' . $mw->{error}->{details}; }

If i just give a literal Hebrew string as the title parameter to $mw->edit, then everything works correctly. What can i do with $pagename so it will be encoded the same way as $text?

Thanks in advance.

Version: Perl 5.10 on Ubuntu 9.10.

Comment on encoding of file names
Select or Download Code
Re: encoding of file names
by almut (Canon) on Mar 25, 2010 at 20:14 UTC

    You could try to Encode::decode() $filename, e.g. from UTF-8, if you suspect that's how the names are stored in the filesystem.

      That's the thing - i would probably try using Encode, but i don't know the from encoding. The to encoding is supposed to be utf8.

      Or do you mean to say that UTF-8 and utf8 are different things?

        but i don't know the from encoding

        I think you can't do much harm by just trying 'UTF-8' as the from encoding :)

        $filename = decode("UTF-8", $filename);

        When your file names are in fact in UTF-8, things will likely work out fine. Otherwise, you'll know they aren't UTF-8 encoded, and you can try some other encoding...

Re: encoding of file names
by ikegami (Pope) on Mar 25, 2010 at 20:17 UTC

    Perl treats file names as opaque strings of bytes*. In unix, they are usually characters encoded using the current locale, which in turn, is usually UTF-8.

    You need to decode the file names, and you'll be all set.

    * — This presents a problem on Windows which stores them as characters, but that's not relevant here.

      Can you please tell me how to decode them? I am not quite experienced with Encode. I still don't understand what to specify as the from encoding.

      And since you mention it, i actually am curious about Windows, because i plan to make this program portable and i already had similar problems on Windows in the past.

        I still don't understand what to specify as the from encoding.

        Usually, the file names are text encoded as per the local's encoding. In fact, I dare say that's the expectation.

        Most users have a UTF-8 locale. You could assume UTF-8, and worry about it when someone complains.

        If you want to actually get the right encoding, your best bet is probably the following undocumented function:

        require encoding; # Or "use encoding ();" with the parens. my $locale_encoding = encoding::_get_locale_encoding();

        This is what core module open uses.

        And since you mention it, i actually am curious about Windows

        It's a real mess. Bad support by builtins and by modules for accessing Windows's wide character interface. Bad support at finding the code page (last time I checked) of the single-byte interface (even though it's easier than locales in unix). Maybe some other time.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://830948]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (6)
As of 2015-07-05 14:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (67 votes), past polls