http://www.perlmonks.org?node_id=1113229

aksjain has asked for the wisdom of the Perl Monks concerning the following question:

Hi All, I am pretty new to perl and as part of my first perl script, i need to extract files from a downloaded zip. The important part here is that the files in the zip have their names in Chinese or Japanese characters. I tried with Archive::Extract module and Archive::Zip but could not figure out how to achieve it.

$filename = "somefile.zip"; $dest_dir = "c:/somedir"; my $zip = Archive::Zip->new(); local $Archive::Zip::UNICODE = 1; unless ( $zip->read($filename) == AZ_OK ) { die "Error Reading Zip File !"; } my @members = $zip->members(); foreach (@members){ $zip->extractMemberWithoutPaths( $_, "$dest_dir\\$_"); }

I tried doing it with Archive::Extract also with following code.

my $ae = Archive::Extract->new(archive => $filename); + my $ok = $ae->extract(to => $dest_dir) or die $ae->error;

Everything works perfect till my zip file has files named with English characters, but things get worse when the names are Chinese or Japanese. Any help for this would be really appreciated.

Replies are listed 'Best First'.
Re: Seeking help with Extracting files from zip
by jonadab (Parson) on Jan 14, 2015 at 14:16 UTC
    Everything works perfectly as long as my zip file has files named with Latin characters, but things get worse when the names are Chinese or Japanese.

    If you can answer a couple of questions, it may give us the information that would allow us to actually help you...

    • When you say "things get worse", what does that mean, exactly? Do you get an error message? Does extractMemberWithoutPaths return an error code? Are the files written at all? Do their filenames get mangled? What do the resulting mangled filenames look like? The phrase "get worse" is kind of vague, so I'm not really sure what's going wrong, and without knowing what's going wrong, it's hard to know how to fix it.
    • Can you, via some other means (say, using a file manager, or on the command line) create files in the location where you are trying to extract these, with CJK characters in their filenames? Not all filesystems support such things, and so without knowing what kind of filesystem your storage device is formatted with, we can't know for sure that it's even theoretically possible for such filenames to be created. Do you know what kind of filesystem it is? ext3? NTFS? HFS+? FAT32? Something else? (If you don't know the answer to this, just telling us what operating system you're using and whether you're saving on your computer's main hard drive or to a USB flash drive or some other location could provide clues.) Update: I just noticed the "c:/somedir" in your code, which I suspect narrows things down a little. NTFS *ought* to be able to handle CJK filenames, I think, although depending on what version of Windows you have it might require that the relevant language options be installed, in the Language thingydoo in the control panel. If you're using a really old Windows (95/98/Me) or for some other reason are using FAT32, then I'm less sure.

    Oh, one other thing: the following code works for me (Perl 5.10.1, debian oldstable amd64):

    nathan@warthog:~/test2/extract$ ls somefile.zip nathan@warthog:~/test2/extract$ perl -e ' $filename = "somefile.zip"; $dest_dir = "/home/nathan/test2/extract"; use Archive::Zip; my $zip = Archive::Zip->new(); local $Archive::Zip::UNICODE = 1; unless ( $zip->read($filename) == AZ_OK ) { die "Error Reading Zip File !"; } foreach my $m ($zip->members()) { print "Member $m:\n "; my $err = $zip->extractMemberWithoutPaths( $m, "$dest_dir/" . $m->fi +leName); print "Error: $err" if $err; print $/; }' Member Archive::Zip::ZipFileMember=HASH(0xdfdd30): Member Archive::Zip::ZipFileMember=HASH(0xdfe2b8): Member Archive::Zip::ZipFileMember=HASH(0xdfe5a0): Member Archive::Zip::ZipFileMember=HASH(0xdfe888): Member Archive::Zip::ZipFileMember=HASH(0xdfeb98): Member Archive::Zip::ZipFileMember=HASH(0xdfee80): nathan@warthog:~/test2/extract$ ls 한국어 somefile.zip ગુજર&# +2750;તી ಕನ್ನಡ ব&#24 +94;ংলা 中文 日本語 nathan@warthog:~/test2/extract$
    (Perlmonks seems unable or perhaps unwilling to handle most of those characters -- and if unwilling I can't blame them; this is by design an English-language venue -- but they display just fine on my terminal when I do the ls. Of course, I created my somefile.zip using the zip program that comes with Debian; yours may have been created using different software...)

      Thanks for your reply. By worse i mean the filename characters gets mangled. I tried extracting the same zip file using windows tools like winrar and it extracts the files with proper names likewise it should be. I am using windows 7 and have Japanese and Chinese language packs installed on the machine. Below is the link to an image which shows the difference in name of the folder.

      http://s4.postimg.org/bnphbww59/Japanese.png

        Ok, so I assume the katakana filename there is what it's supposed to look like, and the gibberish filename with nearly more than twice as many characters, most of which look like they came from the miscellanous-symbols-and-accented-characters section of an eight-bit character set, is the result of running your code?

        This definitely looks like a charset translation issue. The Archive::Zip documentation indicates that setting UNICODE causes the filenames in the archive to be treated as UTF8. Perhaps they're not? Maybe they're UTF16 or UTF32 or some other Unicode encoding (or, heaven help you, some pre-Unicode Asian encoding like Shift-JIS or whatnot)? If you can figure out what fiddling needs to be done to preserve the encoding, you can pass the correct filename to extractMemberWithoutPaths and that should probably work, I think...

        Unfortunately, I don't know that much about the details of the character sets involved, but maybe someone else will come along now and be able to recognize what's going on. (Even just being able to recognize which encoding is being erroneously treated as though it were some other encoding would go a long way toward figuring out the problem.) That image you provided should help.

      Thanks a lot jonadab. The solution you suggested just worked seemlessly. Thanks a lot.

      Can you please help me with another question asked on http://www.perlmonks.org/?node_id=1113737 ??

Re: Seeking help with Extracting files from zip
by pmqs (Friar) on Jan 14, 2015 at 16:02 UTC

    Were the zip files created using Windows 7 Compressed Folders?

    If they were, the filenames will not be stored in Unicode in the zip files (see this link for the gory details). I *think* they will use whatever code page is active on your Windows setup.

    To tell for sure can you run the zipdetails script that comes with perl against the zip file you used for the screenshot & post the output?

      Hahaha, I just noticed this ... 7zip handles/makes the names utf8, whereas windows does codepage nonsense
Re: Seeking help with Extracting files from zip ( Win32::Unicode )
by Anonymous Monk on Jan 14, 2015 at 23:36 UTC
    If you want any kind of unicode filenames on windows, you need Win32::Unicode
    { use Win32::Unicode qw/ -native /; open my($fh), '>:raw', $unicodename; ... }

      Seems to work for me , naturally it doesn't chmod/umask, not throughly tested, assumes utf8 (no easy flag I could see that signals utf8), and its a monkeypatch , unzipwin32unicode.pl

        How I created the test directory, later zipped with 7zip, makekebabs.pl, its meat
        kebabing the ћевап.txt
        kebabing the ranjić.txt
        kebabing the ćevap.txt
        kebabing the кебапче.txt
        kebabing the kebab.txt
        
Re: Seeking help with Extracting files from zip
by Anonymous Monk on Jan 14, 2015 at 18:09 UTC
    What do members look like? (when you print them out to a file)
      Here is an example of the output I get when with a zip file that uses Unicode properly. The important thing to look for is the presence of the "Language Encoding".
      0000 LOCAL HEADER #1 04034B50 0004 Extract Zip Spec 0A '1.0' 0005 Extract OS 00 'MS-DOS' 0006 General Purpose Flag 0800 [Bit 11] 1 'Language Encoding' 0008 Compression Method 0000 'Stored' 000A Last Mod Time 3EB3B54C 'Thu May 19 22:42:24 2011' 000E CRC 00000000 0012 Compressed Length 00000000 0016 Uncompressed Length 00000000 001A Filename Length 0009 001C Extra Length 001C 001E Filename 'tmp/PĀé' 0027 Extra ID #0001 5455 'UT: Extended Timestamp' 0029 Length 0009 002B Flags '03 mod access' 002C Mod Time 4DD58EBF 'Thu May 19 22:42:23 2011' 0030 Access Time 4DD59079 'Thu May 19 22:49:45 2011' 0034 Extra ID #0002 7875 'ux: Unix Extra Type 3' 0036 Length 000B 0038 Version 01 0039 UID Size 04 003A UID 000003E8 003E GID Size 04 003F GID 000003E8 0043 CENTRAL HEADER #1 02014B50 0047 Created Zip Spec 1E '3.0' 0048 Created OS 03 'Unix' 0049 Extract Zip Spec 0A '1.0' 004A Extract OS 00 'MS-DOS' 004B General Purpose Flag 0800 [Bit 11] 1 'Language Encoding' 004D Compression Method 0000 'Stored' 004F Last Mod Time 3EB3B54C 'Thu May 19 22:42:24 2011' 0053 CRC 00000000 0057 Compressed Length 00000000 005B Uncompressed Length 00000000 005F Filename Length 0009 0061 Extra Length 0018 0063 Comment Length 0000 0065 Disk Start 0000 0067 Int File Attributes 0000 [Bit 0] 0 'Binary Data' 0069 Ext File Attributes 81A40000 006D Local Header Offset 00000000 0071 Filename 'tmp/PĀé' 007A Extra ID #0001 5455 'UT: Extended Timestamp' 007C Length 0005 007E Flags '03 mod access' 007F Mod Time 4DD58EBF 'Thu May 19 22:42:23 2011' 0083 Extra ID #0002 7875 'ux: Unix Extra Type 3' 0085 Length 000B 0087 Version 01 0088 UID Size 04 0089 UID 000003E8 008D GID Size 04 008E GID 000003E8 0092 END CENTRAL HEADER 06054B50 0096 Number of this disk 0000 0098 Central Dir Disk no 0000 009A Entries in this disk 0001 009C Total Entries 0001 009E Size of Central Dir 0000004F 00A2 Offset to Central Dir 00000043 00A6 Comment Length 0000 Done
        Actually, if you run zipdetails in verbose mode we can get a hex dump of what is actually stored in the zip file. The "-v" option enables verbose mode below
        $ zipdetails -v abc.zip 0000 0004 50 4B 03 04 LOCAL HEADER #1 04034B50 0004 0001 0A Extract Zip Spec 0A '1.0' 0005 0001 00 Extract OS 00 'MS-DOS' 0006 0002 00 08 General Purpose Flag 0800 [Bit 11] 1 'Language Encoding' 0008 0002 00 00 Compression Method 0000 'Stored' 000A 0004 4C B5 B3 3E Last Mod Time 3EB3B54C 'Thu May 19 22:42 +:24 2011' 000E 0004 00 00 00 00 CRC 00000000 0012 0004 00 00 00 00 Compressed Length 00000000 0016 0004 00 00 00 00 Uncompressed Length 00000000 001A 0002 09 00 Filename Length 0009 001C 0002 1C 00 Extra Length 001C 001E 0009 74 6D 70 2F Filename 'tmp/PĀé' 50 C4 80 C3 A9 0027 0002 55 54 Extra ID #0001 5455 'UT: Extended Timesta +mp' 0029 0002 09 00 Length 0009 002B 0001 03 Flags '03 mod access' 002C 0004 BF 8E D5 4D Mod Time 4DD58EBF 'Thu May 19 22:42 +:23 2011' 0030 0004 79 90 D5 4D Access Time 4DD59079 'Thu May 19 22:49 +:45 2011' 0034 0002 75 78 Extra ID #0002 7875 'ux: Unix Extra Type +3' 0036 0002 0B 00 Length 000B 0038 0001 01 Version 01 0039 0001 04 UID Size 04 003A 0004 E8 03 00 00 UID 000003E8 003E 0001 04 GID Size 04 003F 0004 E8 03 00 00 GID 000003E8 0043 0004 50 4B 01 02 CENTRAL HEADER #1 02014B50 0047 0001 1E Created Zip Spec 1E '3.0' 0048 0001 03 Created OS 03 'Unix' 0049 0001 0A Extract Zip Spec 0A '1.0' 004A 0001 00 Extract OS 00 'MS-DOS' 004B 0002 00 08 General Purpose Flag 0800 [Bit 11] 1 'Language Encoding' 004D 0002 00 00 Compression Method 0000 'Stored' 004F 0004 4C B5 B3 3E Last Mod Time 3EB3B54C 'Thu May 19 22:42 +:24 2011' 0053 0004 00 00 00 00 CRC 00000000 0057 0004 00 00 00 00 Compressed Length 00000000 005B 0004 00 00 00 00 Uncompressed Length 00000000 005F 0002 09 00 Filename Length 0009 0061 0002 18 00 Extra Length 0018 0063 0002 00 00 Comment Length 0000 0065 0002 00 00 Disk Start 0000 0067 0002 00 00 Int File Attributes 0000 [Bit 0] 0 'Binary Data' 0069 0004 00 00 A4 81 Ext File Attributes 81A40000 006D 0004 00 00 00 00 Local Header Offset 00000000 0071 0009 74 6D 70 2F Filename 'tmp/PĀé' 50 C4 80 C3 A9 007A 0002 55 54 Extra ID #0001 5455 'UT: Extended Timesta +mp' 007C 0002 05 00 Length 0005 007E 0001 03 Flags '03 mod access' 007F 0004 BF 8E D5 4D Mod Time 4DD58EBF 'Thu May 19 22:42 +:23 2011' 0083 0002 75 78 Extra ID #0002 7875 'ux: Unix Extra Type +3' 0085 0002 0B 00 Length 000B 0087 0001 01 Version 01 0088 0001 04 UID Size 04 0089 0004 E8 03 00 00 UID 000003E8 008D 0001 04 GID Size 04 008E 0004 E8 03 00 00 GID 000003E8 0092 0004 50 4B 05 06 END CENTRAL HEADER 06054B50 0096 0002 00 00 Number of this disk 0000 0098 0002 00 00 Central Dir Disk no 0000 009A 0002 01 00 Entries in this disk 0001 009C 0002 01 00 Total Entries 0001