http://www.perlmonks.org?node_id=1197846

In 2002 I wanted to collect a bunch of code together to administer a gaming system. The main loop of the program accepted user input and invoked a routine to process the request. Eventually I decided that the idea of having a generic shell to which user-defined commands could be added would be useful and a crude version of a command shell was introduced in 2004. Over the intervening years the script and support routines have grown to over 8000 lines of code. Today I released version 3.0.

Even if you don't find the concept of a command shell useful, there is a large collection of helpful functions in cs_fn.pl. For example, you will never get a "Wide character in print" error if you pass your strings to safeString.

print safeString("\x{263A}\n");

Home Page: http://www.exelana.com/techie/perl/cs.html

Documentation: http://www.exelana.com/techie/perl/CommandShell.pdf

Download: http://www.exelana.com/techie/perl/cs.tgz

Let me know what you think!

Replies are listed 'Best First'.
Re: Command Shell
by Your Mother (Chancellor) on Aug 26, 2017 at 03:28 UTC

    This is only an encoding output layer / character encoding mismatch. It should be solved with proper layers, not hacks.

    moo@cow[6]~>perl -le 'print "\x{263A}"'
    Wide character in print at -e line 1.
    ☺
    moo@cow[7]~>perl -CSD -le 'print "\x{263A}"'
    ☺
    # OR
    moo@cow[8]~>perl -MEncode -le 'print encode_utf8("\x{263A}")'
    ☺
    # OR
    moo@caow[9]~>perl -le 'binmode STDOUT, ":utf8"; print "\x{263A}"'
    ☺
    # OR
    moo@cow[10]~>perl -Mopen=:std,":encoding(UTF-8)" -le 'print "\x{263A}"'
    ☺
    

    This is in a UTF-8 aware terminal. These things RFC:MUST line up at every level to be correct: input, processing, output. There are no shortcuts and there is no need for hacks or special functions if it’s done properly.

    Unicode and character encoding issues are difficult at first and messy but they must be understood or else it’s generally an accident or luck when things appear to work. Related reading–

      I wish I had seen Joel's article in 2003 as it would have probably saved me dozens of hours on Google reading various explanations for why I was running into some particular weirdness trying to process the contents of someone's webpage. It was an interesting read and filled in some bits of historical context that I either never knew or had forgotten. A lot has changed on computers since I wrote my first program in 1975 and some of it I remember vividly and some of it I've relegated to my internal bit bucket. (By the way, I wrote programs on punched cards back in the day and know what a real bit bucket is.) At any rate, I didn't learn anything new about encodings from Joel's article. But, I will quote him:

      Almost every stupid “my website looks like gibberish” or “she can’t read my emails when I use accents” problem comes down to one naive programmer who didn’t understand the simple fact that if you don’t tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends. There are over a hundred encodings and above code point 127, all bets are off.

      If everyone would put the charset in their Content-Type header (or META tag) and actually follow the encoding rules for that encoding, life would be grand, but I can assure you after parsing the contents of thousands of web pages they don't! And since I have zero control over what people put on their websites, I can either throw up my hands and ignore the content or I can hack together some code that takes whatever they send me and try to make sense of it. For what it's worth, here's the current version of my code to try to figure out the encoding for a webpage so that I can pass its contents to Encode::decode.

      #################################################################### # Look at the first bytes of the content # FE FF -> utf-16be # FF FE -> utf-16le # EF BB BF -> utf-8 return "utf-16be" if $resp->content =~ m/^\xFE\xFF/; return "utf-16le" if $resp->content =~ m/^\xFF\xFE/; return "utf-8" if $resp->content =~ m/^\xEF\xBB\xBF/; #################################################################### # Use the header # content-type: text/html; charset=XXX my $ct = $resp->{_headers}->{'content-type'} || ""; my ($cs) = lc($ct) =~ m/\s+charset\s*=\s*["']*([^\s'";]*)/; $cs =~ s/utf=/utf\-/g if defined($cs); return $cs if defined($cs) && $cs ne ""; #################################################################### # Use the META tag # <meta http-equiv="content-type" content="text/html; charset=XXX"> ($ct) = $resp->content =~ m/<meta http-equiv=["']?content-type["']? content=["']([^"']*)["']/i; ($cs) = $ct =~ m/\s+charset\s*=\s*([^\s']*)/ if defined($ct) && $ct ne ""; $cs =~ s/utf=8/utf\-8/ig if defined($cs); return $cs if defined($cs) && $cs ne ""; #################################################################### # Default character set based on encoding of characters return "utf-8" if $resp->content =~ m/[^\x00-\x7F]/; return "iso-8859-1";

      And even after all that effort to determine the correct encoding, I still get content that needs more tender-loving care. So after I've done everything I can to determine the encoding and passed the result to Encode::decode, I still have to run it through safeString to be sure that I don't encounter any further weirdness. You asked in another thread if I had example webpages for the problems I'm trying to solve. Perhaps I should have saved the URLs or the content of those pages so that I could look back and marvel at how the content editor ever expected it to work, but I didn't. I tweaked my code to handle the weirdness and moved on.

        I just noticed that I handle "utf=8" differently in the Content-Type block than in the META block. That's probably because I encountered that particular bit of stupidity in both contexts at different times.
Re: Command Shell
by tdlewis77 (Sexton) on Aug 23, 2017 at 06:36 UTC

    Apparently I didn't have a good enough test case as the following still gets a "Wide character in print" error:

    print safeString("\x{26C4}\n");

    First bug to fix in 3.1. :-)

      This is a fun test case:

      map { $s = sprintf('\x{26%02x}',$_); print "$s: ",safeString(eval("\"$s\"")),"\n" } 0..255