Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Using literal Japanese filenames in legacy CP932 encoding with system(), etc.

by almut (Canon)
on Oct 24, 2006 at 16:13 UTC ( #580313=perlquestion: print w/replies, xml ) Need Help??
almut has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

I've been put in charge of migrating a Japanese customer site with many Windows machines from an ancient version of Perl (at the moment they're using jperl v5.005_03 MSWin32-x86, SJIS version) to something current like v5.8.8. (As Perl now comes with comprehensive unicode support, IO filters and stuff, the jperl patch is no longer maintained, and, of course, doesn't apply to any recent version of Perl.)

The idea is to make the upgrade as smooth as possible. Over the years, lots of little jperl-specific scripts have accumulated at the site (several hundreds, the admins say...). So, ideally, they would not have to touch any of those, but rather just roll out the new version of Perl (plus some compatibility module), and everything should work as before. At least, that's the plan.

All the old scripts contain the statement "use I18N::Japanese;" (that's how the specific jperl functionality is enabled in the binary -- the .pm file just contains a "1;"), so I thought a new I18N/Japanese.pm would be the ideal place to put my compatibility code...

I figured it would essentially involve saying "use encoding 'cp932';"1 (the old scripts are written in Microsofts CP932, roughly equivalent to SJIS), to make Perl parse any literal strings, regexes, etc. in the script correctly and convert them to Perl's internal unicode format. So far, so good. Thing is, they have code like this2

system("mkdir \"C:\\Documents and Settings\\All Users\\\x83\x58\x83\x5 +E\x81\x5B\x83\x67 \x83\x81\x83\x6A\x83\x85\x81\x5B\\\x83\x76\x83\x8D\ +x83\x4F\x83\x89\x83\x80\\\x91\xE3\x95\x5C\"");

This doesn't work, because the pathname being passed to system() now is in perl's internal unicode format, instead of the CP932 that the windows side expects. I'm not sure how to handle this best.

What I've come up with so far is to override/wrap Perl's internal system() function, in order to do the required conversion of the arguments explicitly:

use Encode "encode"; *CORE::GLOBAL::system = sub { # explicitly convert from Perl's internal unicode format # into legacy CP932 encoding my @args = map encode("cp932", $_), @_; CORE::system(@args); # call original internal routine };

Although this does work essentially, I can't help thinking this is way more cumbersome than things typically need to be in Perl. In particular, as I would have to write similar wrappers for all other functions that take a filename argument (mkdir(), chdir(), open(), opendir(), rename(), unlink(), glob() and friends...). This can't be it!? :)

So, I'm wondering if I'm missing that magic incantation which would somehow convert all filenames to the desired target encoding when passing them to the respective system functions... IOW, what's the best way to emulate the old jperl behaviour with recent versions of Perl? (As I understand things (correct me if I'm wrong), this worked with jperl, because it kept strings internally in SJIS/CP932, and directly operated on this legacy encoding.)

Any suggestions welcome.

Thanks,
Almut

__________

1  actually, in this case, I have to write "require encoding; encoding->import('cp932');" (to avoid the implicit BEGIN{} block). In Perl-5.8.8, "use encoding 'cp932'" seems to be lexically scoped (contrary to what the documentation says). So, putting it in a module wouldn't have any effect on the code that's "use"ing that module.

2  to circumvent the automatic html-entity-ification of the SJIS 8-bit octets (and thus render the code unusable for anyone who'd like to play around with this), I wrote them as hex values -- in the real scripts they are of course as raw SJIS 8-bit values.

Just in case, here's the same string as unicode codepoints:

system("mkdir \"C:\\Documents and Settings\\All Users\\\x{30B9}\x{30BF +}\x{30FC}\x{30C8} \x{30E1}\x{30CB}\x{30E5}\x{30FC}\\\x{30D7}\x{30ED}\ +x{30B0}\x{30E9}\x{30E0}\\\x{4EE3}\x{8868}\"");

( And, if your browser is able to display the respective unicode entities, that's what the SJIS part looks like:
"...\\スタート メニュー\\プログラム\\代表" )

Replies are listed 'Best First'.
Re: Using literal Japanese filenames in legacy CP932 encoding with system(), etc.
by graff (Chancellor) on Oct 24, 2006 at 22:44 UTC
    I would have to write similar wrappers for all other functions that take a filename argument (mkdir(), chdir(), open(), opendir(), rename(), unlink(), glob() and friends...). This can't be it!? :)

    I'm probably not the right person to answer this (so I hope someone else comes up with a better answer...), but this sort of question has come up before (with a reprise), and it seemed to me that the way to go there was to have a module that provides replacements for the built-in functions that take and return file names.

    Your case takes the issue to a deeper level than the other monk, who just needed to handle windows/greek file names from a cgi script. In order to get a wide assortment of existing scripts to work with Perl 5.8.x, the kind of module you've started with is probably the way I would go -- just expand it to cover the other built-ins as needed.

    Anything "cleaner" would probably take you into hacking the source and building your own patched version of perl, which is probably a less attractive solution overall.

    (But wait... Maybe there's a compiler flag that can be set to handle this? I haven't looked at that, but it might be worth looking at even though it sounds like a long-shot... Still, maybe your site isn't eager to be put into the position of having to build their perl installation from sources with customizations?)

    Presumably, you'll also need to deal with making sure that all file handles are set to ":encoding(cp932)" as well; that would just be included in the replacement function for "open", as well as adding "binmode STDIN,...; binmode STDOUT,...; binmode STDERR,...;" as an executable part of your new version of I18N/Japanese.pm. If/when your people start handling utf8 content in data files, this might get tricky; good luck with that...

      ...it seemed to me that the way to go there was to have a module that provides replacements for the built-in functions that take and return file names.

      Thanks for your response, and for the pointer to the archived thread. All in all, it makes me feel reassured I'm not entirely on the wrong track (actually, I was hoping I might be ;) and that there in fact is an easier solution). No big issue though, I'll just get on with implementing the wrappers. Any tips from more experienced monks on what subtle pitfalls to avoid along these lines? I figure I should be on the safe side when simply limiting my encoding conversions to scalar arguments (specifically thinking of built-ins with multiple prototypes, like open()...).

      Anything "cleaner" would probably take you into hacking the source and building your own patched version of perl, which is probably a less attractive solution overall.

      (But wait... Maybe there's a compiler flag that can be set to handle this? I haven't looked at that, but it might be worth looking at even though it sounds like a long-shot... Still, maybe your site isn't eager to be put into the position of having to build their perl installation from sources with customizations?)

      Right. I have to admit I haven't really looked into this yet, so I'll do some more RTFM and/or take a peek into the sources. If anyone has any promising hints, I'd of course appreciate it. A simple compile-time switch would be absolutely OK (I'll be building / providing their Perl packages anyway). But I'd rather not do any further patching -- just afraid that too many hacks are going to be a nightmare to maintain in the long run... (which might get them into a similar situation to what they're in now, in another five to ten years).

      Presumably, you'll also need to deal with making sure that all file handles are set to ":encoding(cp932)" as well; ...

      I already have the stub code in place for these filehandle related things. On the Perl side everything is working beautifully -- thanks to the nice and clean design of Perl IO layers (Kudos to everyone who's been working on this, BTW). What's giving me more of a headache here is that it seems I have to switch between various encodings, depending on what specific version of Windows I'm on, etc. The site is rather heterogeneous, and I guess I don't have to tell you that "no Windows is like any other" -- at least when you get down to the nitty-gritties (e.g. one dumps the registry in ucs-2le, the other in legacy encoding, ...) Moreover, I'm no Windows expert -- just dabbling in largely unknown territory :)

      Anyway, thanks again,
      Almut

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://580313]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (9)
As of 2018-12-10 20:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How many stories does it take before you've heard them all?







    Results (52 votes). Check out past polls.

    Notices?
    • (Sep 10, 2018 at 22:53 UTC) Welcome new users!