Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

tr{}{} doesn't wanna work.. what am I doing wrong?

by ultranerds (Pilgrim)
on Feb 24, 2012 at 13:05 UTC ( #955912=perlquestion: print w/ replies, xml ) Need Help??
ultranerds has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have this code in an upload script:
print STDERR "Filename 1: $filename \n"; my $test = $filename =~ tr{&[]?! +;()" }{___aaaaaacceeeeeeeeiiiiiioooooouuuuuu_________i +onnaeouaeiou}; print STDERR "FOO REPLACE: $test \n"; print STDERR "Filename 2: $filename \n";
When I run this in upload.cgi, it comes up as:
Filename 1: regular-expressions-cheat-sheet -v2.pdf FOO REPLACE: 1 Filename 2: regular-expressions-cheat-sheet_-v2.pdf
Notice the space got replaced with _... but thats it. Yet in my test script on my Strawberry Perl on my PC, I get:
my $test = "tegular-expressions-cheat-sheet_-v2.pdf"; $test =~ tr{&[]?!;()" +}{___aaaaaacceeeeeeeeiiiiiioooooouuuuuu_________ionnaeouaeiou}; print "now: $test \n";
..give me (as I would expect);
C:\Users\Andy>perl test2.pl now: tegular-expressions-cheat-sheet_aie-v2.pdf
can anyone suggest what I may be doing wrong? Its driving me up the wall!

TIA

Andy

Comment on tr{}{} doesn't wanna work.. what am I doing wrong?
Select or Download Code
Re: tr{}{} doesn't wanna work.. what am I doing wrong?
by JavaFan (Canon) on Feb 24, 2012 at 13:20 UTC
    Encoding? How's your $filename encoded, and how's the encoding of your source code? Try upgrading and/or downgrading your $filename.
      Hi,

      Thanks, thats what I was thinking... he is what I have atm:

      my $uploaddir = $CFG->{build_static_path} . '/ajax_upload/tmp_uplo +ads'; use Links::Plugins; my $PCFG = Links::Plugins::get_plugin_user_cfg ('ImageUpload'); my $maxFileSize = $PCFG->{max_file_size} * 1024 * 1024; # 1/2mb +max file size... my $num_max_files = $PCFG->{MaxNumberImages}; my $image_field = $PCFG->{ImageField}; use CGI; my $cgi = new CGI; my $queueID = $cgi->param('queueID'); use Data::Dumper; print $IN->header; # Where the data gets passed too... my $handle = $cgi->param('Filedata'); my $temp_id = $cgi->param('temp_id'); my $filename = $cgi->param('Filename'); my $file = $cgi->param('Filedata'); my $count = $DB->table("AJAXFileUploads")->count( { temp_id => $te +mp_id } ) || 0; # print "TempID : $temp_id "; if ($DB->table("AJAXFileUploads")->count( { temp_id => $temp_id, u +pload_filename => $filename } ) > 0) { print $IN->header; print qq|This file already seems to have been uploaded...|; exit; } print STDERR qq|$count > $num_max_files \n|; if ($count >= $num_max_files) { print $IN->header; print qq|Sorry, you can only upload $num_max_files ...|; exit; } print STDERR "Filename: $filename \n"; print STDERR "Temp ID: $temp_id \n"; print STDERR "Filename 1: $filename \n"; my $test = $filename =~ tr{&[]?! +;()" }{___aaaaaacceeeeeeeeiiiiiioooooouuuuuu_________i +onnaeouaeiou}; print STDERR "FOO REPLACE: $test \n"; print STDERR "Filename 2: $filename \n"; my $count = $DB->table("AJAXFileUploads")->count( { temp_id => $te +mp_id } ) || 0; if ($DB->table("AJAXFileUploads")->count( { temp_id => $temp_id, u +pload_filename => $filename } ) > 0) { print $IN->header; print qq|This file already seems to have been uploaded...|; exit; } print STDERR qq|$count > $num_max_files \n|; if ($count >= $num_max_files) { print $IN->header; print qq|Sorry, you can only upload $num_max_files ...|; exit; } my @tmp = split /\./, $filename; if ($tmp[$#tmp] !~ /docx?|ppt|pps|pptx|ppsx|pdf/i) { print qq|Invalid file type...|; print STDERR "file has been NOT been uploaded... \n"; exit; } print STDERR "Making dir: $uploaddir/$temp_id \n"; mkdir("$uploaddir/$temp_id"); open(WRITEIT, ">$uploaddir/$temp_id/$filename") or die "Cant write + to $uploaddir/$filename. Reason: $!"; while (<$handle>) { print WRITEIT $_; } close(WRITEIT); my $check_size = -s "$uploaddir/$temp_id/$filename"; print STDERR qq|Main filesize: $check_size Max Filesize: $maxFile +Size \n\n|; if ($check_size < 1) { print $IN->header(); print STDERR "ooops, its empty - gonna get rid of it!\n"; print qq|File is empty...|; print STDERR "file has been NOT been uploaded... \n"; } elsif ($check_size > $maxFileSize) { print $IN->header(); print STDERR "ooops, its too large - gonna get rid of it!\n"; print qq|File is too large...|; print STDERR "file has been NOT been uploaded... \n"; } else { print $IN->header(); print "1"; print STDERR "file has been successfully uploaded... thank you +.\n"; $DB->table("AJAXFileUploads")->add( { temp_id => $temp_id, upload_filename => $filename, timestamp => time(), rand_id => $queueID } ) || die $GT::SQL::error; }
      Not too sure what you mean by upgrading/degrading?

      TIA!

      Andy
Re: tr{}{} doesn't wanna work.. what am I doing wrong?
by moritz (Cardinal) on Feb 24, 2012 at 13:31 UTC
    In my experience, tr and Unicode don't mix well. Here's my approach (source code stored in UTF-8 encoding):
    use strict; use warnings; use utf8; use 5.010; use Unicode::Normalize qw/NFKD/; binmode STDOUT, ':encoding(UTF-8)'; sub frob { my $str = NFKD(shift); $str =~ s/\pM//g; $str =~ s/[^a-z-A-z0-9]/_/g; $str; } my $test = '&[]?!;()" +'; say frob $test; __END__ []AAAaaaCcEEEEeeeeIIIiiiOOOoooUUUuuu_________ionNaeouAEIOU

    Update: Since several people misunderstood me, I feel I should clarify. I wrote that in my experience, Unicode and tr/// don't mix. Which is to say that tr/// isn't buggy, but I haven't encountered any code in the wild that correctly handles Unicode strings with tr///, because tr wasn't designed with Unicode in mind.

      Hi, Thanks for your suggestion :) For some reason it gives me a weird output, compared to yours?
      C:\Users\Andy>perl test2.pl Malformed UTF-8 character (unexpected non-continuation byte 0xc2, imme +diately after start byte 0xc0) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xc4, imme +diately after start byte 0xc2) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xe0, imme +diately after start byte 0xc4) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xe2, imme +diately after start byte 0xe0) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xe4, imme +diately after start byte 0xe2) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xc7, imme +diately after start byte 0xe4) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xe7, imme +diately after start byte 0xc7) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xc9, imme +diately after start byte 0xe7) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xca, imme +diately after start byte 0xc9) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xc8, imme +diately after start byte 0xca) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xcb, imme +diately after start byte 0xc8) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xe9, imme +diately after start byte 0xcb) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xea, imme +diately after start byte 0xe9) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xe8, imme +diately after start byte 0xea) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xeb, imme +diately after start byte 0xe8) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xcf, imme +diately after start byte 0xeb) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xcc, imme +diately after start byte 0xcf) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xce, imme +diately after start byte 0xcc) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xef, imme +diately after start byte 0xce) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xec, imme +diately after start byte 0xef) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xee, imme +diately after start byte 0xec) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xd6, imme +diately after start byte 0xee) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xd4, imme +diately after start byte 0xd6) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xd2, imme +diately after start byte 0xd4) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xf6, imme +diately after start byte 0xd2) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xf4, imme +diately after start byte 0xf6) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xf2, imme +diately after start byte 0xf4) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xdc, imme +diately after start byte 0xf2) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xdb, imme +diately after start byte 0xdc) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xd9, imme +diately after start byte 0xdb) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xfc, imme +diately after start byte 0xd9) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xfb, imme +diately after start byte 0xfc) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xf9, imme +diately after start byte 0xfb) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0x3f, imme +diately after start byte 0xf9) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected continuation byte 0xab, with no +preceding start byte) in subroutine entry at test2.pl line 10. Malformed UTF-8 character (unexpected continuation byte 0xbb, with no +preceding start byte) in subroutine entry at test2.pl line 10. Malformed UTF-8 character (unexpected non-continuation byte 0xf3, imme +diately after start byte 0xed) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xf1, imme +diately after start byte 0xf3) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xd1, imme +diately after start byte 0xf1) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xe1, imme +diately after start byte 0xd1) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xe9, imme +diately after start byte 0xe1) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xf3, imme +diately after start byte 0xe9) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xfa, imme +diately after start byte 0xf3) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xc1, imme +diately after start byte 0xfa) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xc9, imme +diately after start byte 0xc1) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xcd, imme +diately after start byte 0xc9) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xd3, imme +diately after start byte 0xcd) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0x0a, imme +diately after start byte 0xd3) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (1 byte, need 2, after start byte 0xda) in s +ubroutine entry at test2.pl line 10. _[]__________________________________________________________ C:\Users\Andy>
      Any ideas? TIA! Andy

        Check the encoding of your input data. Decode all data (to unicode) before operating on it in any way.

      tr and Unicode don't mix well

      In what way?  Seems to work fine for me.  Could you provide an example that fails? (just curious)

      use Devel::Peek; my $test = "\x{2345}\x{3456}"; Dump $test; $test =~ tr/\x{2345}\x{3456}/XY/; Dump $test; __END__ SV = PV(0x768bc8) at 0x7907d8 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x782630 "\342\215\205\343\221\226"\0 [UTF8 "\x{2345}\x{3456}"] CUR = 6 LEN = 8 SV = PV(0x768bc8) at 0x7907d8 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x782630 "XY"\0 [UTF8 "XY"] CUR = 2 LEN = 8

      (I'm only using \x{...} here because PM code sections don't support Unicode — it works the same way with a UTF-8 encoded source file when using "use utf8;")

        In what way?

        By not supporting Unicode-aware character classes, and listing all Unicode characters in a certain category is a usually a moot endeavor.

        The OP is the best example: it doesn't list all accented characters that could be ASCIIfied.

      I'd recommend this approach because no matter what you put into your list, you're going to miss a character, especially as new characters are added (who cared about € 14 years ago?) What happens if someone inserts Å? á? Ø? Æ? ð ? þ ? — ? „ ? Kanji? Mathematical symbols?

      You're *much* better off listing the characters that you want to keep, and removing all others, if only because it's less work to maintain in the long run, as you don't have to worry about adding new characters, or if someone's messed up the encoding of the script.

Re: tr{}{} doesn't wanna work.. what am I doing wrong?
by runrig (Abbot) on Feb 24, 2012 at 15:41 UTC
    It seems like Text::Unidecode would do what you want (and a better job of it).
Re: tr{}{} doesn't wanna work.. what am I doing wrong?
by tchrist (Pilgrim) on Feb 25, 2012 at 00:23 UTC
    Why do you wish to commit irreversible injury to your data?

    Where is the eye of the needle in your data-processing that will allow nothing but teh very most primitive of pre-electric typewriter keys through its tiny aperture?

    Why arent you using UAX#44, UAX#15, UTS#10, and UTS#35 to guide you in this?

    Are you completely certain than you have no choice but to turn back the calendar by fifty years and more?

    Because if you are, then you are going about this wrong. And if you arent, you shouldnt be doing it at all.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://955912]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (12)
As of 2014-12-27 21:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (177 votes), past polls