Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: converting smart quotes

by tobyink (Canon)
on Mar 19, 2012 at 22:26 UTC ( [id://960485]=note: print w/replies, xml ) Need Help??


in reply to converting smart quotes

You have a utf8-encoded string. You need to convert it to Perl's native Unicode string format (which also happens to be utf8-encoded internallly, but marked with a special flag such that multibyte sequences are treated as single characters).

You can do this like:

utf8::decode($string);

The utf8::decode function works in-place (like chomp), so you can just call it in a void context.

That said, you won't find a \x92 character on the page you linked to, because there is none. There's a \x{2019} character though.

 

The following takes the page content, and makes ASCII control characters and non-ASCII characters visible.

use 5.010001; use LWP::UserAgent; my $url = 'http://publib.boulder.ibm.com/infocenter/brjrules/v7r0m +3/basic/tocView.jsp?toc=/com.ibm.websphere.ilog.jrules.doc/toc.xml'; my $content = LWP::UserAgent->new->get($url)->content; utf8::decode($content); $content =~ s { ([\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x{1FFFFF}]) } { sprintf('[U+%04X]', ord($1)) }gex; print $content;
perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

Replies are listed 'Best First'.
Re^2: converting smart quotes
by ikegami (Patriarch) on Mar 19, 2012 at 23:05 UTC
    my $content = LWP::UserAgent->new->get($url)->content; utf8::decode($content);
    can be replaced with
    my $content = LWP::UserAgent->new->get($url)->decoded_content;

      In this case, yes, but decoded_content does a lot of other stuff besides. Also I wanted the OP to have a better idea of what's going on - that when Perl gets data from the outside world it's often in bytes, which need decoding into Perl's native Unicode representation.

      perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

        In this case, yes, but decoded_content does a lot of other stuff beside

        Stuff that must be done, such as removing compression when used.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://960485]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (2)
As of 2024-04-25 02:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found