Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^2: converting smart quotes

by slugger415 (Scribe)
on Mar 20, 2012 at 04:18 UTC ( #960519=note: print w/ replies, xml ) Need Help??


in reply to Re: converting smart quotes
in thread converting smart quotes

Hi all, thank you so much for your comments and suggestions, and many apologies for my bad linking and explanations. Some responses:

First off, I'm not sure why (ww) you don't see the What's new string. Perhaps these screen grabs will help describe what I'm talking about, from the above URL (and I hope I'm not breaking a rule here):

pic 1

pic 2

2nd, I believe your example #2 is the smart quote I'm discussing, though it appears slightly differently in my text editor than it does in my browser. Here's a paste of the text here:

What’s new

As for my specific Perl code:

my $browser = LWP::UserAgent->new; my $response = $browser->get( "http://publib.boulder.ibm.com/infocente +r/brjrules/v7r0m3/basic/tocView.jsp?toc=/com.ibm.websphere.ilog.jrule +s.doc/toc.xml" ); my $content = $$response{_content}; ## yes inefficient coding, but it +works open(OUT, ">content.html"); print OUT $content; close(OUT);

Adding utf8::decode to that, as suggested:

utf8::decode($content); $content =~ s { ([\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x{1FFFFF}]) } { sprintf('[U+%04X]', ord($1)) }gex;

produces this:

What[U+2019]s new

At least it's finding it! But I confess I don't follow the regex there (I'm still learning...), and is there some shortcut in the code I'm missing?

Sorry if I'm asking dumb questions here or just not getting it. And I would like to better understand that regex -- is there some place to learn more about that?

Thank you all once again.


Comment on Re^2: converting smart quotes
Select or Download Code
Re^3: converting smart quotes
by tangent (Curate) on Mar 20, 2012 at 12:05 UTC
    tobyink's regular expression is making the character (and others if present) visible.
    To convert the specific character you mention to a normal ASCII single quote:
    $content =~ s/\x{2019}/'/g;
Re^3: converting smart quotes
by ww (Bishop) on Mar 20, 2012 at 12:14 UTC

    My bad. Didn't find it because I didn't look closely enough... and when I used 'find' I used a common, straight single quote instead of a smartquote for the symbol. Duh! So, my apologies for that.

    The regex is using a "character class" to match any single instance of a character in the range \x00 through \x08 or \x0c, \x0e through </c>\x1f</c> or ...

    ... well, at that point, I'm thoroughly puzzled. The curly bracket notation in the last element is usually used to specify ('quantify') the number of instances of a preceeding character, but in this case, my first guess would be that it's a typo. Wiser heads may have another intepretation. I don't understand and haven't found an explanation, yet for the use of {}s around the \x{1FFFFF})

    As for learning more about regexen, see perlrequick, perlretut, and the invaluable "Mastering Regular Expressions" by Friedl (ca USD 30, last I looked). The book is where I'll look first to try to understand the use of curly brackets as something other than a mistake.

      In a regular expression, the "\xNN" escape always takes exactly two hexadecimal digits, so can only match characters in the range "\x00" to "\xFF". Adding braces like "\x{1FFFFF}" allows an arbitrary number of hexadecimal digits (presumably limited only by your architecture's integer size). perlre should explain it - search it for "long hex char".

      Escapes like this also work in interpolated strings. e.g.

      perl -Mutf8::all -E'say qq(\x{263a})'
      perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

        Thanks -- so what does sprintf('[U+%04X]' do, and why is it coming out as What[U+2019]s new?

        Ok I think I've gotten this working in my own way (that my novice brain can understand):

        utf8::decode($content); while($content =~ /([\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x{1FFFFF}])/){ my $char = '&#' . ord($1) . ';'; $content =~ s/([\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x{1FFFFF}])/$char/; }

        I'm sure there's a more efficient way to do this, but the resulting &#xxxx; structure seems to work. And the important thing is the regex should be finding all those weirdo characters. Thank you!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://960519]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (12)
As of 2015-07-03 09:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (51 votes), past polls