Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Seeking help with Extracting files from zip

by jonadab (Parson)
on Jan 14, 2015 at 14:16 UTC ( #1113230=note: print w/replies, xml ) Need Help??


in reply to Seeking help with Extracting files from zip

Everything works perfectly as long as my zip file has files named with Latin characters, but things get worse when the names are Chinese or Japanese.

If you can answer a couple of questions, it may give us the information that would allow us to actually help you...

  • When you say "things get worse", what does that mean, exactly? Do you get an error message? Does extractMemberWithoutPaths return an error code? Are the files written at all? Do their filenames get mangled? What do the resulting mangled filenames look like? The phrase "get worse" is kind of vague, so I'm not really sure what's going wrong, and without knowing what's going wrong, it's hard to know how to fix it.
  • Can you, via some other means (say, using a file manager, or on the command line) create files in the location where you are trying to extract these, with CJK characters in their filenames? Not all filesystems support such things, and so without knowing what kind of filesystem your storage device is formatted with, we can't know for sure that it's even theoretically possible for such filenames to be created. Do you know what kind of filesystem it is? ext3? NTFS? HFS+? FAT32? Something else? (If you don't know the answer to this, just telling us what operating system you're using and whether you're saving on your computer's main hard drive or to a USB flash drive or some other location could provide clues.) Update: I just noticed the "c:/somedir" in your code, which I suspect narrows things down a little. NTFS *ought* to be able to handle CJK filenames, I think, although depending on what version of Windows you have it might require that the relevant language options be installed, in the Language thingydoo in the control panel. If you're using a really old Windows (95/98/Me) or for some other reason are using FAT32, then I'm less sure.

Oh, one other thing: the following code works for me (Perl 5.10.1, debian oldstable amd64):

nathan@warthog:~/test2/extract$ ls somefile.zip nathan@warthog:~/test2/extract$ perl -e ' $filename = "somefile.zip"; $dest_dir = "/home/nathan/test2/extract"; use Archive::Zip; my $zip = Archive::Zip->new(); local $Archive::Zip::UNICODE = 1; unless ( $zip->read($filename) == AZ_OK ) { die "Error Reading Zip File !"; } foreach my $m ($zip->members()) { print "Member $m:\n "; my $err = $zip->extractMemberWithoutPaths( $m, "$dest_dir/" . $m->fi +leName); print "Error: $err" if $err; print $/; }' Member Archive::Zip::ZipFileMember=HASH(0xdfdd30): Member Archive::Zip::ZipFileMember=HASH(0xdfe2b8): Member Archive::Zip::ZipFileMember=HASH(0xdfe5a0): Member Archive::Zip::ZipFileMember=HASH(0xdfe888): Member Archive::Zip::ZipFileMember=HASH(0xdfeb98): Member Archive::Zip::ZipFileMember=HASH(0xdfee80): nathan@warthog:~/test2/extract$ ls 한국어 somefile.zip ગુજર&# +2750;તી ಕನ್ನಡ ব&#24 +94;ংলা 中文 日本語 nathan@warthog:~/test2/extract$
(Perlmonks seems unable or perhaps unwilling to handle most of those characters -- and if unwilling I can't blame them; this is by design an English-language venue -- but they display just fine on my terminal when I do the ls. Of course, I created my somefile.zip using the zip program that comes with Debian; yours may have been created using different software...)

Replies are listed 'Best First'.
Re^2: Seeking help with Extracting files from zip
by aksjain (Acolyte) on Jan 14, 2015 at 14:37 UTC

    Thanks for your reply. By worse i mean the filename characters gets mangled. I tried extracting the same zip file using windows tools like winrar and it extracts the files with proper names likewise it should be. I am using windows 7 and have Japanese and Chinese language packs installed on the machine. Below is the link to an image which shows the difference in name of the folder.

    http://s4.postimg.org/bnphbww59/Japanese.png

      Ok, so I assume the katakana filename there is what it's supposed to look like, and the gibberish filename with nearly more than twice as many characters, most of which look like they came from the miscellanous-symbols-and-accented-characters section of an eight-bit character set, is the result of running your code?

      This definitely looks like a charset translation issue. The Archive::Zip documentation indicates that setting UNICODE causes the filenames in the archive to be treated as UTF8. Perhaps they're not? Maybe they're UTF16 or UTF32 or some other Unicode encoding (or, heaven help you, some pre-Unicode Asian encoding like Shift-JIS or whatnot)? If you can figure out what fiddling needs to be done to preserve the encoding, you can pass the correct filename to extractMemberWithoutPaths and that should probably work, I think...

      Unfortunately, I don't know that much about the details of the character sets involved, but maybe someone else will come along now and be able to recognize what's going on. (Even just being able to recognize which encoding is being erroneously treated as though it were some other encoding would go a long way toward figuring out the problem.) That image you provided should help.

Re^2: Seeking help with Extracting files from zip
by aksjain (Acolyte) on Jan 19, 2015 at 11:25 UTC

    Thanks a lot jonadab. The solution you suggested just worked seemlessly. Thanks a lot.

    Can you please help me with another question asked on http://www.perlmonks.org/?node_id=1113737 ??

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1113230]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (7)
As of 2019-12-07 17:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Strict and warnings: which comes first?



    Results (162 votes). Check out past polls.

    Notices?