Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^2: Faster utf8 escaping.

by kyle (Abbot)
on Apr 08, 2009 at 01:05 UTC ( #756200=note: print w/replies, xml ) Need Help??


in reply to Re: Faster utf8 escaping.
in thread Faster utf8 escaping.

Thank you! I like this solution. You're right, core modules can often get you most of the way to where you want to go.

That said, it needed some modification to work with the larger battery of tests I have locally. The first thing I noticed is that Encode doesn't always produce a four digit hex entity. I took care of that by allowing the regular expression to match two or four characters. Then came the string that encoded as "Mar&#ed;a; F", so I made the "characters" match only hex digits.

Now the replacement looks like this:

$s =~ s/&#x([a-f0-9]{2})?([a-f0-9]{2});/'\\u' . ($1||'00') . $2/ieg;

That still doesn't take care of the problem that ikegami raises. I figure I can preprocess with s/&/&x/g and then back with s/&x/&/g when it's done (or something), but then we're back up to five full scans over the input string. It might still be faster, but I haven't tested that yet.

Thanks again.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://756200]
help
Chatterbox?
[jdporter]: omg, I f love Perl!
[choroba]: say unpack 'H*', pack 'B*', $mask =~ /0b([01]+)/;
[choroba]: use C instead of H to get the decimal number
[erix]: ( no love like f love )
[choroba]: f* love

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (9)
As of 2018-02-20 16:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    When it is dark outside I am happiest to see ...














    Results (272 votes). Check out past polls.

    Notices?