http://www.perlmonks.org?node_id=462670

newrisedesigns has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I have a script that gets data from Google Adsense. The data is in Unicode (UTF-16, I believe). When I try to pattern match on the data, I can only match one character. A pattern that looks for more than one character in sequence fails.

A typical line looks like:

5/18/05     184     7       3.8%    6.14    1.13

Matching \d works, but attempting to match \d{2}, \d+\/ or anything else that catches two characters in sequence fails. I take it this is because Unicode uses more than one byte per character.

I'm only extracting data from this Unicode text, and do not need to output Unicode. Why don't the regexps work? If they're not supposed to work, how can I convert the text to ISO-8859-1/Latin1? I tried converting using iconv, but to no avail (would return UTF-16 regardless of args (used -f UTF-16 -t UTF-8).

Thanks in advance for your help.