http://www.perlmonks.org?node_id=526169

I had a Win32 Perl coding challenge to search a directory(WinXP, NTFS) that may contain files with unicode / wide filenames in addition to ANSI filenames. Through trial, error, and searches against Perl mailing list archives, it seems apparent that Win32 Perl's builtin directory functions do not support return Win32 unicode / wide filenames. More specifically, the builtin functions return filenames like "??????_HostID_2006-01-19_213218.xls" when the filename contains unicode / wide characters (same behavior as the DOS "dir" command). The problem is that you can't pass these filenames to something like stat or via OLE to ask Excel to open it. After much trial and error, I found a solution using Win32::API, the FindFirstFileW, FindNextFileW, and FindClose Win32 API calls -- with an "unpack" workaround suggested by $Bill Luebkert.
use strict; use Win32::API; use Unicode::String; use English; $OUTPUT_AUTOFLUSH=1; $Win32::API::DEBUG = 0; binmode(STDOUT, ":utf8"); use constant ERROR_NO_MORE_FILES => 18; use constant INVALID_HANDLE_VALUE => -1; Win32::API::Struct-> typedef('FILETIME', qw( DWORD dwLowDateTime; DWORD dwHighDateTime; )); # 8 bytes use constant FILE_ATTRIBUTE_READONLY => 0x00000001; use constant FILE_ATTRIBUTE_HIDDEN => 0x00000002; use constant FILE_ATTRIBUTE_SYSTEM => 0x00000004; use constant FILE_ATTRIBUTE_DIRECTORY => 0x00000010; use constant FILE_ATTRIBUTE_ARCHIVE => 0x00000020; use constant FILE_ATTRIBUTE_NORMAL => 0x00000080; use constant FILE_ATTRIBUTE_TEMPORARY => 0x00000100; use constant FILE_ATTRIBUTE_COMPRESSED => 0x00000800; use constant MAX_PATH => 260; Win32::API::Struct-> typedef('WIN32_FIND_DATAW', qw( DWORD dwFileAttributes; FILETIME ftCreationTime; FILETIME ftLastAccessTime; FILETIME ftLastWriteTime; DWORD nFileSizeHigh; DWORD nFileSizeLow; DWORD dwReserved0; DWORD dwReserved1; WCHAR cFileName[520]; WCHAR cAlternateFileName[28]; )); # 4 + 8 x 3 + 4 x 4 + 520 + 28 = 592 bytes # Note: Win32::API::Struct->Typedef should seemingly # allocate 2 bytes per WCHAR, the sizeof('WCHAR'), but it # actually allocates only 1 byte -- same as TCHAR my $FindFirstFile = Win32::API->new('kernel32.dll', 'FindFirstFileW', +'PS', 'N') or die "FindFirstFile: $^E"; my $FindNextFile = Win32::API->new('kernel32.dll', 'FindNextFileW', ' +NS', 'I') or die "FindNextFile $^E"; my $FindClose = Win32::API->new('kernel32.dll', 'FindClose', 'N', +'I') or die "FileClose $^E"; # set your own value here... my $FileSpec = "//?/C:/My Documents/Tool/*.xls"; my $FileInfo = Win32::API::Struct-> new('WIN32_FIND_DATAW'); my $uFileSpec = Unicode::String->new; $uFileSpec->utf8($FileSpec); print "FileSpec = ", $uFileSpec->as_string, "\n"; my $handle = $FindFirstFile-> Call($uFileSpec->utf16le, $FileInfo); if ($handle == INVALID_HANDLE_VALUE) { printf "Error is %d - %s\n", Win32::GetLastError (), Win32::FormatMessage (Win32::GetLastError ()); exit(1); } else { print "FindFirstFile worked\n"; my $count = 1; my $ufn = Unicode::String->new; my $ualtfn = Unicode::String->new; # to get the filename in unicode UTF-16LE format, must # unpack the $FileInfo 'buffer' hash element because # Win32 API doesn't recognize the cFileName hash # element as an UTF-16LE format character string # It treats it as an ASCII Z / null terminated string my ($cFileName, $cAlternateFileName) = unpack( "x44A520A28", $File +Info->{buffer} ); # if length is odd, pad filenames with null byte as utf16le char s +tring byte length must be even # occurs when last UTF-16LE char of filename is ANSI (e.g., "S") a +nd unpack A template strips off following null byte # e.g., "S\0" if (length($cFileName) & 1) { $cFileName .= "\x00"; } if (length($cAlternateFileName) & 1) { $cAlternateFileName .= "\x00"; } $ufn->utf16le($cFileName); $ualtfn->utf16le($cAlternateFileName); print "($count) filename = ", $ufn->as_string, "\n"; print "\talt = ", $ualtfn->as_string, "\n"; while (my $result = $FindNextFile->Call($handle,$FileInfo)) { $count++; ($cFileName, $cAlternateFileName) = unpack( "x44A520A28", $Fil +eInfo->{buffer} ); if (length($cFileName) & 1) { $cFileName .= "\x00"; } if (length($cAlternateFileName) & 1) { $cAlternateFileName .= "\x00"; } $ufn->utf16le($cFileName); $ualtfn->utf16le($cAlternateFileName); print "($count) filename = ", $ufn->as_string, "\n"; print "\talt = ", $ualtfn->as_string, "\n"; } } $FindClose->Call($handle) or die "FindClose $^E"; exit(0);

Replies are listed 'Best First'.
Re: Win32 API directory searches that return wide / unicode filenames
by rhesa (Vicar) on Jan 28, 2006 at 15:50 UTC
    I was under the impression that you could make perl use the wide system calls by supplying the -C switch. However, perlrun.pod says:
    (In Perls earlier than 5.8.1 the -C switch was a Win32-only switch that enabled the use of Unicode-aware "wide system call" Win32 APIs. This feature was practically unused, however, and the command line switch was therefore "recycled".)
    I actually had been using that switch in the past. Pity.
Re: Win32 API directory searches that return wide / unicode filenames
by dsully (Acolyte) on Mar 03, 2006 at 17:01 UTC
    Last December, I brought this problem up on p5p:

    http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2005-12/msg00259.html

    No one had any good solution for it.

    Ideally, perl/win32.* will be updated by someone (or by me) to support the Wide functionality. It's rather a huge shortcoming, when standard tools like File::Find and opendir() won't work at all. It's hard to write cross platform code that way when Win32 _Perl_ is fundamentally broken wrt Unicode filenames.

      Ideally, perl/win32.* will be updated by someone (or by me) to support the Wide functionality.

      (Some/All?) the code to support this option is still there from when the -C command line option enabled the use of the wide apis on win32.

      # Win32.c:775-782 /* do the FindFirstFile call */ if (USING_WIDE()) { A2WHELPER(scanname, wbuffer, sizeof(wbuffer)); fh = FindFirstFileW(PerlDir_mapW(wbuffer), &wFindData); } else { fh = FindFirstFileA(PerlDir_mapA(scanname), &aFindData); } # Win32.c:858-870 if (USING_WIDE()) { res = FindNextFileW(dirp->handle, &wFindData); if (res) { W2AHELPER(wFindData.cFileName, buffer, sizeof(buffer)); ptr = buffer; } } else { res = FindNextFileA(dirp->handle, &aFindData); if (res) ptr = aFindData.cFileName; }

      All that is needed is the addition of code to enable this at runtime in perl.c/S_parse_body() where it presumably used to live. I can't see any signs of it there now.

      Of course, it might be much harder to persuade the powers that be to allocate a new switch letter for the purpose. Maybe additional values could be added to the new interpretation of the -C switch (See perlrun).

      Say 'W'/128 for Wide calls. Ie. perl -CW script.pl would do what -C used to do?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        FYI - all the if (USING_WIDE()) calls were removed by a patch from Jan back in November in bleadperl.

        So that's really not the route to go.

        I'm not entirely sure why the A functions are the default instead of W.

Re: Win32 API directory searches that return wide / unicode filenames
by perlslicker (Initiate) on Apr 01, 2006 at 03:46 UTC
    How do I use this code in conjunction with MP3::Info to get the tags of mp3 files with unicode filenames?