Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

converting smart quotes

by slugger415 (Monk)
on Mar 19, 2012 at 21:59 UTC ( [id://960478]=perlquestion: print w/replies, xml ) Need Help??

slugger415 has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I've looked at this topic . I'm trying to convert smart quotes and other "dumb" characters to standard ASCII characters; what I've found, including this topic here, haven't worked.

demoroniser seems to recognize the single smart quote as three separate characters, and makes the 3rd one a <SUP> element.

If you look at this page you'll see the problematic single quote in the What's New string. I can't seem to search on it with a regex or \x92 or one of those... I would like to find and replace all such miscreant characters. Running Tidy converts it to three character entities... I haven't seen a bit of code that works.

Any suggestions would be most appreciated, as always.

UPDATE: also tried HTML::Entities:

encode_entities($b);

result:

What&acirc;&#128;&#153;s

and ord() might work if I could properly search on it...

Scott

Replies are listed 'Best First'.
Re: converting smart quotes
by tobyink (Canon) on Mar 19, 2012 at 22:26 UTC

    You have a utf8-encoded string. You need to convert it to Perl's native Unicode string format (which also happens to be utf8-encoded internallly, but marked with a special flag such that multibyte sequences are treated as single characters).

    You can do this like:

    utf8::decode($string);

    The utf8::decode function works in-place (like chomp), so you can just call it in a void context.

    That said, you won't find a \x92 character on the page you linked to, because there is none. There's a \x{2019} character though.

     

    The following takes the page content, and makes ASCII control characters and non-ASCII characters visible.

    use 5.010001; use LWP::UserAgent; my $url = 'http://publib.boulder.ibm.com/infocenter/brjrules/v7r0m +3/basic/tocView.jsp?toc=/com.ibm.websphere.ilog.jrules.doc/toc.xml'; my $content = LWP::UserAgent->new->get($url)->content; utf8::decode($content); $content =~ s { ([\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x{1FFFFF}]) } { sprintf('[U+%04X]', ord($1)) }gex; print $content;
    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
      my $content = LWP::UserAgent->new->get($url)->content; utf8::decode($content);
      can be replaced with
      my $content = LWP::UserAgent->new->get($url)->decoded_content;

        In this case, yes, but decoded_content does a lot of other stuff besides. Also I wanted the OP to have a better idea of what's going on - that when Perl gets data from the outside world it's often in bytes, which need decoding into Perl's native Unicode representation.

        perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
Re: converting smart quotes
by ww (Archbishop) on Mar 19, 2012 at 23:11 UTC
    There are some problems in the way you posted which make it very hard to know just how to help.

    The chief issue is your link to an IBM page (where "this page" eq http://publib.boulder.ibm.com/infocenter/brjrules/v7r0m3/basic/tocView.jsp?toc=/com.ibm.websphere.ilog.jrules.doc/toc.xml) which leads me to a TOC where the phrase "What's New" is NOT found. As I'm sure you can imagine, even the Monks most generous with their time may consider that a war-stopper.

    The second most serious shortcoming is the possibly unusual sense of your use of the phrase, "smart quotes." At least in the context of M$ Word refers to four very specific characters,

    1. (&#145; or 0x91)
    2. (&#146;)
    3. (&#147)
            and
    4. (&#148)
    .

    That leaves me completely at sixes and nines as to what you mean by your second paragraph. It would probably be better to actually insert the chars inside quotes or somesuch so we can see what's giving you grief... and thus be more likely able to help. (It would also be a helpful were you to post a compilable snippet of your code \the bare minimum to show us how you're trying to deal with the non-ascii chars]).

    Third, demoronizer is probably not quite up to the job, unless you make the same patch to your copy (assuming versions are the same) that derby provided in a reply in the thread you cited.

    And, fourth, please use tags from the PM variant of HTML; especially, please use the [id://485212] method of creating links. If you link with a full a href..., your link will result in some significant fraction of the Monks who follow it finding themselves logged out. For further reference, see What shortcuts can I use for linking to other information?.

      The second most serious shortcoming is the possibly unusual sense of your use of the phrase, "smart quotes."

      MS smart quotes are 91 (‘) and 92 (’) in cp1252. They are U+2018 and U+2019, so they are actually written as &#x2018; and &#x2019; in HTML.

      &#x91; and &#x92; refer to other characters that aren't even present in cp1252.

      U+2018 and U+2019 are E2 80 98 and E2 80 99, so the OP is indeed referring to smart quotes.

      Hi all, thank you so much for your comments and suggestions, and many apologies for my bad linking and explanations. Some responses:

      First off, I'm not sure why (ww) you don't see the What's new string. Perhaps these screen grabs will help describe what I'm talking about, from the above URL (and I hope I'm not breaking a rule here):

      pic 1

      pic 2

      2nd, I believe your example #2 is the smart quote I'm discussing, though it appears slightly differently in my text editor than it does in my browser. Here's a paste of the text here:

      What’s new

      As for my specific Perl code:

      my $browser = LWP::UserAgent->new; my $response = $browser->get( "http://publib.boulder.ibm.com/infocente +r/brjrules/v7r0m3/basic/tocView.jsp?toc=/com.ibm.websphere.ilog.jrule +s.doc/toc.xml" ); my $content = $$response{_content}; ## yes inefficient coding, but it +works open(OUT, ">content.html"); print OUT $content; close(OUT);

      Adding utf8::decode to that, as suggested:

      utf8::decode($content); $content =~ s { ([\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x{1FFFFF}]) } { sprintf('[U+%04X]', ord($1)) }gex;

      produces this:

      What[U+2019]s new

      At least it's finding it! But I confess I don't follow the regex there (I'm still learning...), and is there some shortcut in the code I'm missing?

      Sorry if I'm asking dumb questions here or just not getting it. And I would like to better understand that regex -- is there some place to learn more about that?

      Thank you all once again.

        tobyink's regular expression is making the character (and others if present) visible.
        To convert the specific character you mention to a normal ASCII single quote:
        $content =~ s/\x{2019}/'/g;

        My bad. Didn't find it because I didn't look closely enough... and when I used 'find' I used a common, straight single quote instead of a smartquote for the symbol. Duh! So, my apologies for that.

        The regex is using a "character class" to match any single instance of a character in the range \x00 through \x08 or \x0c, \x0e through </c>\x1f</c> or ...

        ... well, at that point, I'm thoroughly puzzled. The curly bracket notation in the last element is usually used to specify ('quantify') the number of instances of a preceeding character, but in this case, my first guess would be that it's a typo. Wiser heads may have another intepretation. I don't understand and haven't found an explanation, yet for the use of {}s around the \x{1FFFFF})

        As for learning more about regexen, see perlrequick, perlretut, and the invaluable "Mastering Regular Expressions" by Friedl (ca USD 30, last I looked). The book is where I'll look first to try to understand the use of curly brackets as something other than a mistake.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://960478]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2024-04-24 08:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found