converting smart quotes

slugger415 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: converting smart quotes by tobyink (Canon) on Mar 19, 2012 at 22:26 UTC
You have a utf8-encoded string. You need to convert it to Perl's native Unicode string format (which also happens to be utf8-encoded internallly, but marked with a special flag such that multibyte sequences are treated as single characters). You can do this like: `utf8::decode($string);` [download] The `utf8::decode` function works in-place (like `chomp`), so you can just call it in a void context. That said, you won't find a \x92 character on the page you linked to, because there is none. There's a \x{2019} character though. The following takes the page content, and makes ASCII control characters and non-ASCII characters visible. `use 5.010001; use LWP::UserAgent; my $url = 'http://publib.boulder.ibm.com/infocenter/brjrules/v7r0m +3/basic/tocView.jsp?toc=/com.ibm.websphere.ilog.jrules.doc/toc.xml'; my $content = LWP::UserAgent->new->get($url)->content; utf8::decode($content); $content =~ s { ([\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x{1FFFFF}]) } { sprintf('[U+%04X]', ord($1)) }gex; print $content;` [download] `perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'`	[reply] [d/l] [select]
Re^2: converting smart quotes by ikegami (Patriarch) on Mar 19, 2012 at 23:05 UTC
`my $content = LWP::UserAgent->new->get($url)->content; utf8::decode($content);` [download] can be replaced with `my $content = LWP::UserAgent->new->get($url)->decoded_content;` [download]	[reply] [d/l] [select]
Re^3: converting smart quotes by tobyink (Canon) on Mar 20, 2012 at 00:51 UTC
In this case, yes, but `decoded_content` does a lot of other stuff besides. Also I wanted the OP to have a better idea of what's going on - that when Perl gets data from the outside world it's often in bytes, which need decoding into Perl's native Unicode representation. `perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'`	[reply] [d/l]
Re^4: converting smart quotes by ikegami (Patriarch) on Mar 20, 2012 at 03:08 UTC
Re: converting smart quotes by ww (Archbishop) on Mar 19, 2012 at 23:11 UTC
There are some problems in the way you posted which make it very hard to know just how to help. The chief issue is your link to an IBM page (where "this page" eq `http://publib.boulder.ibm.com/infocenter/brjrules/v7r0m3/basic/tocView.jsp?toc=/com.ibm.websphere.ilog.jrules.doc/toc.xml`) which leads me to a TOC where the phrase "What's New" is NOT found. As I'm sure you can imagine, even the Monks most generous with their time may consider that a war-stopper. The second most serious shortcoming is the possibly unusual sense of your use of the phrase, "smart quotes." At least in the context of M$ Word refers to four very specific characters, ‘ (`` or `0x91`) ’ (``) “ (`&#147`) and ” (`&#148`) . That leaves me completely at sixes and nines as to what you mean by your second paragraph. It would probably be better to actually insert the chars inside quotes or somesuch so we can see what's giving you grief... and thus be more likely able to help. (It would also be a helpful were you to post a compilable snippet of your code \the bare minimum to show us how you're trying to deal with the non-ascii chars]). Third, demoronizer is probably not quite up to the job, unless you make the same patch to your copy (assuming versions are the same) that derby provided in a reply in the thread you cited. And, fourth, please use tags from the PM variant of HTML; especially, please use the `[id://485212]` method of creating links. If you link with a full `a href...`, your link will result in some significant fraction of the Monks who follow it finding themselves logged out. For further reference, see What shortcuts can I use for linking to other information?.	[reply] [d/l] [select]
Re^2: converting smart quotes by ikegami (Patriarch) on Mar 20, 2012 at 03:23 UTC
The second most serious shortcoming is the possibly unusual sense of your use of the phrase, "smart quotes." MS smart quotes are `91` (‘) and `92` (’) in cp1252. They are U+2018 and U+2019, so they are actually written as `‘` and `’` in HTML. `` and `` refer to other characters that aren't even present in cp1252. U+2018 and U+2019 are `E2 80 98` and `E2 80 99`, so the OP is indeed referring to smart quotes.	[reply] [d/l] [select]
Re^2: converting smart quotes by slugger415 (Monk) on Mar 20, 2012 at 04:18 UTC
Hi all, thank you so much for your comments and suggestions, and many apologies for my bad linking and explanations. Some responses: First off, I'm not sure why (ww) you don't see the What's new string. Perhaps these screen grabs will help describe what I'm talking about, from the above URL (and I hope I'm not breaking a rule here): pic 1 pic 2 2nd, I believe your example #2 is the smart quote I'm discussing, though it appears slightly differently in my text editor than it does in my browser. Here's a paste of the text here: `What�s new` [download] As for my specific Perl code: `my $browser = LWP::UserAgent->new; my $response = $browser->get( "http://publib.boulder.ibm.com/infocente +r/brjrules/v7r0m3/basic/tocView.jsp?toc=/com.ibm.websphere.ilog.jrule +s.doc/toc.xml" ); my $content = $$response{_content}; ## yes inefficient coding, but it +works open(OUT, ">content.html"); print OUT $content; close(OUT);` [download] Adding utf8::decode to that, as suggested: `utf8::decode($content); $content =~ s { ([\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x{1FFFFF}]) } { sprintf('[U+%04X]', ord($1)) }gex;` [download] produces this: `What[U+2019]s new` [download] At least it's finding it! But I confess I don't follow the regex there (I'm still learning...), and is there some shortcut in the code I'm missing? Sorry if I'm asking dumb questions here or just not getting it. And I would like to better understand that regex -- is there some place to learn more about that? Thank you all once again.	[reply] [d/l] [select]
Re^3: converting smart quotes by tangent (Parson) on Mar 20, 2012 at 12:05 UTC
tobyink's regular expression is making the character (and others if present) visible. To convert the specific character you mention to a normal ASCII single quote: `$content =~ s/\x{2019}/'/g;` [download]	[reply] [d/l]
Re^3: converting smart quotes by ww (Archbishop) on Mar 20, 2012 at 12:14 UTC
My bad. Didn't find it because I didn't look closely enough... and when I used 'find' I used a common, straight single quote instead of a smartquote for the symbol. Duh! So, my apologies for that. The regex is using a "character class" to match any single instance of a character in the range `\x00 through \x08` or `\x0c, \x0e` through </c>\x1f</c> or ... ... well, at that point, I'm thoroughly puzzled. The curly bracket notation in the last element is usually used to specify ('quantify') the number of instances of a preceeding character, but in this case, my first guess would be that it's a typo. Wiser heads may have another intepretation. I don't understand and haven't found an explanation, yet for the use of `{}`s around the `\x{1FFFFF}`) As for learning more about regexen, see perlrequick, perlretut, and the invaluable "Mastering Regular Expressions" by Friedl (ca USD 30, last I looked). The book is where I'll look first to try to understand the use of curly brackets as something other than a mistake.	[reply] [d/l] [select]
Re^4: converting smart quotes by tobyink (Canon) on Mar 20, 2012 at 13:14 UTC
Re^5: converting smart quotes by slugger415 (Monk) on Mar 20, 2012 at 14:49 UTC
Re^5: converting smart quotes by slugger415 (Monk) on Mar 20, 2012 at 14:30 UTC
Some notes below your chosen depth have not been shown here


Problems? Is your data what you think it is?
	PerlMonks