Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Re^2: converting smart quotes

by slugger415 (Scribe)
on Mar 20, 2012 at 04:18 UTC ( #960519=note: print w/replies, xml ) Need Help??

in reply to Re: converting smart quotes
in thread converting smart quotes

Hi all, thank you so much for your comments and suggestions, and many apologies for my bad linking and explanations. Some responses:

First off, I'm not sure why (ww) you don't see the What's new string. Perhaps these screen grabs will help describe what I'm talking about, from the above URL (and I hope I'm not breaking a rule here):

pic 1

pic 2

2nd, I believe your example #2 is the smart quote I'm discussing, though it appears slightly differently in my text editor than it does in my browser. Here's a paste of the text here:

What’s new

As for my specific Perl code:

my $browser = LWP::UserAgent->new; my $response = $browser->get( " +r/brjrules/v7r0m3/basic/tocView.jsp?toc=/ +s.doc/toc.xml" ); my $content = $$response{_content}; ## yes inefficient coding, but it +works open(OUT, ">content.html"); print OUT $content; close(OUT);

Adding utf8::decode to that, as suggested:

utf8::decode($content); $content =~ s { ([\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x{1FFFFF}]) } { sprintf('[U+%04X]', ord($1)) }gex;

produces this:

What[U+2019]s new

At least it's finding it! But I confess I don't follow the regex there (I'm still learning...), and is there some shortcut in the code I'm missing?

Sorry if I'm asking dumb questions here or just not getting it. And I would like to better understand that regex -- is there some place to learn more about that?

Thank you all once again.

Replies are listed 'Best First'.
Re^3: converting smart quotes
by tangent (Priest) on Mar 20, 2012 at 12:05 UTC
    tobyink's regular expression is making the character (and others if present) visible.
    To convert the specific character you mention to a normal ASCII single quote:
    $content =~ s/\x{2019}/'/g;
Re^3: converting smart quotes
by ww (Bishop) on Mar 20, 2012 at 12:14 UTC

    My bad. Didn't find it because I didn't look closely enough... and when I used 'find' I used a common, straight single quote instead of a smartquote for the symbol. Duh! So, my apologies for that.

    The regex is using a "character class" to match any single instance of a character in the range \x00 through \x08 or \x0c, \x0e through </c>\x1f</c> or ...

    ... well, at that point, I'm thoroughly puzzled. The curly bracket notation in the last element is usually used to specify ('quantify') the number of instances of a preceeding character, but in this case, my first guess would be that it's a typo. Wiser heads may have another intepretation. I don't understand and haven't found an explanation, yet for the use of {}s around the \x{1FFFFF})

    As for learning more about regexen, see perlrequick, perlretut, and the invaluable "Mastering Regular Expressions" by Friedl (ca USD 30, last I looked). The book is where I'll look first to try to understand the use of curly brackets as something other than a mistake.

      In a regular expression, the "\xNN" escape always takes exactly two hexadecimal digits, so can only match characters in the range "\x00" to "\xFF". Adding braces like "\x{1FFFFF}" allows an arbitrary number of hexadecimal digits (presumably limited only by your architecture's integer size). perlre should explain it - search it for "long hex char".

      Escapes like this also work in interpolated strings. e.g.

      perl -Mutf8::all -E'say qq(\x{263a})'
      perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

        Ok I think I've gotten this working in my own way (that my novice brain can understand):

        utf8::decode($content); while($content =~ /([\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x{1FFFFF}])/){ my $char = '&#' . ord($1) . ';'; $content =~ s/([\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x{1FFFFF}])/$char/; }

        I'm sure there's a more efficient way to do this, but the resulting &#xxxx; structure seems to work. And the important thing is the regex should be finding all those weirdo characters. Thank you!

        Thanks -- so what does sprintf('[U+%04X]' do, and why is it coming out as What[U+2019]s new?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://960519]
and a soft breeze sighs...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (7)
As of 2017-03-25 21:52 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (313 votes). Check out past polls.