Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Handling windoze filenames with odd charactters

by cormanaz (Deacon)
on Feb 20, 2011 at 17:51 UTC ( [id://889206]=perlquestion: print w/replies, xml ) Need Help??

cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Howdy Bros. I am working on a script to rename some pdf files on a windoze machine. Someone has generated a spreadsheet with the name of a subfolder the file is in, the new filename it's supposed to get, and the existing filename.

I started by writing a script to stat the existing files like so

use strict; my $workdir = "C:/work"; open(IN,"pdffix.txt") or die "Can't open input: $!\n"; while(<IN>) { chomp; my ($folder,$newfn,$pdfn) = split(/\t/); $pdfn =~ s/^\"|\"$//g; my $filename = "$workdir/$folder/$pdfn"; unless (-e $filename) { print "can't find $folder $pdfn\n"; } }
This works for most of the pdfs, but others have filenames with odd characters (commas, hyphens, colons, parens, underscores; part of the reason they need renaming), and aren't being recognized by stat even though they exist in the directory.

Presumably I won't be able to rename a file I can't stat, so how do I properly specify these odd filenames?



Replies are listed 'Best First'.
Re: Handling windoze filenames with odd charactters
by ciderpunx (Vicar) on Feb 20, 2011 at 18:16 UTC
    Hey Steve, So your spreadsheet is a tsv file? Can you print out the filename just before you test for its existence? To show us what the script is checking the existence of i.e.
    print "Looking for : '$filename'\n";

Re: Handling windoze filenames with odd charactters
by chrestomanci (Priest) on Feb 20, 2011 at 18:29 UTC

    I would be very surprised if stat or rename has problems with files that contain odd characters. I think it is much more likely that you have a bug in your script, so the filename you are passing to stat is not what you think.

    Have you checked that the problem files actually exist. It is possible that the person who prepared the spreadsheet did so by typing in the filenames by hand, and made mistakes. It is also possible that MS Excel's auto correct feature changed the characters, for example by changing a plain hyphen (ASCII 0x45: -) into an em hyphen (Unicode U+2014: —)

    Also when you write the substitution: $pdfn =~ s/^\"|\"$//g; I presume that you are looking to remove quotes from the start or the end of the string. I think you need to enclose the ^\" and \"$ clauses in round brackets in order to use the alternation operator, as otherwise it might ignore the anchors on the start and end of the string. In other words the regexp engine will treat that substitution as: /^((\")|(\"))$/ and remove quotes from any part of the $pdfn string. I would write the substitution as: $pdfn =~ s/^\"?(.*)\"?$/$1/g;

        OK. Colour me supprised.

        When I wrote that I would be surprised if stat or rename had problems with files that contain odd characters, I was actualy thinking of characters is the ASCII character set, not Unicode, howerver, I am suprised and disapointed that perl cannot transparently handle unicode in filenames.

        Perl has for many years transparently handled unicode in string varables. There are of course many pitfalls in constructing those strings from data external to the script, but in this case that should not be the programer's problem. Perl's readdir should just make the appropare Windows System calls to get the unicode filename, and store that filename, complete with any unicode in an internal string.

        The programer should then be able to read and write to files with those names without worring if they contain unicode or not. Obvously if the programer is transforming filenames they they need to be carefull, but in many cases that is not an issue. It is far more common to open and read a file than it is to rename one.

        I think that it is a mistake in 2011 for perl to deleberately use the old Win9x API to get an ASCII filename for backwards compatibility reasons, when the last Win9x OS was retired many years ago.

Re: Handling windoze filenames with odd charactters
by cormanaz (Deacon) on Feb 20, 2011 at 21:17 UTC
    Good advice. When I printed out the wins as well as the fails I found that the wins have odd characters too. So it's not that. Also was able to see some cases where there were trailing blanks after the filenames on the spreadsheet and fixing that helped a little. In other cases it's just manual error typing the spreadsheet, mostly random extra blanks.

    chrestomanci I see your point about the alternation operator, but is seems to work as intended the way it is.


Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://889206]
Approved by chrestomanci
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2024-07-14 19:25 GMT
Find Nodes?
    Voting Booth?

    No recent polls found

    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.