comment on

Hi All,

I've been put in charge of migrating a Japanese customer site with many Windows machines from an ancient version of Perl (at the moment they're using jperl v5.005_03 MSWin32-x86, SJIS version) to something current like v5.8.8. (As Perl now comes with comprehensive unicode support, IO filters and stuff, the jperl patch is no longer maintained, and, of course, doesn't apply to any recent version of Perl.)

The idea is to make the upgrade as smooth as possible. Over the years, lots of little jperl-specific scripts have accumulated at the site (several hundreds, the admins say...). So, ideally, they would not have to touch any of those, but rather just roll out the new version of Perl (plus some compatibility module), and everything should work as before. At least, that's the plan.

All the old scripts contain the statement "use I18N::Japanese;" (that's how the specific jperl functionality is enabled in the binary -- the .pm file just contains a "1;"), so I thought a new I18N/Japanese.pm would be the ideal place to put my compatibility code...

I figured it would essentially involve saying "use encoding 'cp932';"¹ (the old scripts are written in Microsofts CP932, roughly equivalent to SJIS), to make Perl parse any literal strings, regexes, etc. in the script correctly and convert them to Perl's internal unicode format. So far, so good. Thing is, they have code like this²

system("mkdir \"C:\\Documents and Settings\\All Users\\\x83\x58\x83\x5
+E\x81\x5B\x83\x67 \x83\x81\x83\x6A\x83\x85\x81\x5B\\\x83\x76\x83\x8D\
+x83\x4F\x83\x89\x83\x80\\\x91\xE3\x95\x5C\"");
[download]

This doesn't work, because the pathname being passed to system() now is in perl's internal unicode format, instead of the CP932 that the windows side expects. I'm not sure how to handle this best.

What I've come up with so far is to override/wrap Perl's internal system() function, in order to do the required conversion of the arguments explicitly:

use Encode "encode";

*CORE::GLOBAL::system = sub {
    # explicitly convert from Perl's internal unicode format
    # into legacy CP932 encoding
    my @args = map encode("cp932", $_), @_;
    CORE::system(@args);  # call original internal routine
};
[download]

Although this does work essentially, I can't help thinking this is way more cumbersome than things typically need to be in Perl. In particular, as I would have to write similar wrappers for all other functions that take a filename argument (mkdir(), chdir(), open(), opendir(), rename(), unlink(), glob() and friends...). This can't be it!? :)

So, I'm wondering if I'm missing that magic incantation which would somehow convert all filenames to the desired target encoding when passing them to the respective system functions... IOW, what's the best way to emulate the old jperl behaviour with recent versions of Perl? (As I understand things (correct me if I'm wrong), this worked with jperl, because it kept strings internally in SJIS/CP932, and directly operated on this legacy encoding.)

Any suggestions welcome.

Thanks,
Almut

__________

¹ actually, in this case, I have to write "require encoding; encoding->import('cp932');" (to avoid the implicit BEGIN{} block). In Perl-5.8.8, "use encoding 'cp932'" seems to be lexically scoped (contrary to what the documentation says). So, putting it in a module wouldn't have any effect on the code that's "use"ing that module.

² to circumvent the automatic html-entity-ification of the SJIS 8-bit octets (and thus render the code unusable for anyone who'd like to play around with this), I wrote them as hex values -- in the real scripts they are of course as raw SJIS 8-bit values.

Just in case, here's the same string as unicode codepoints:

system("mkdir \"C:\\Documents and Settings\\All Users\\\x{30B9}\x{30BF
+}\x{30FC}\x{30C8} \x{30E1}\x{30CB}\x{30E5}\x{30FC}\\\x{30D7}\x{30ED}\
+x{30B0}\x{30E9}\x{30E0}\\\x{4EE3}\x{8868}\"");
[download]

( And, if your browser is able to display the respective unicode entities, that's what the SJIS part looks like:
"...\\スタートメニュー\\プログラム\\代表" )

In reply to Using literal Japanese filenames in legacy CP932 encoding with system(), etc. by almut

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Your skill will accomplish what the force of many cannot
	PerlMonks