Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

A Regex for no-break space Unicode Entities

by kettle (Beadle)
on Sep 13, 2006 at 07:44 UTC ( #572690=perlquestion: print w/ replies, xml ) Need Help??
kettle has asked for the wisdom of the Perl Monks concerning the following question:

I have been having an extremely trying 30 minutes or so, trying to regex out no-break space unicode entities, represented in my very large raw text file as \302\240. I was just about to post a request for some help, but figured out a solution to my problem. Perhaps it isn't the best solution, but I was unable to find anything concise which solved my problem, on the web, but I suppose there are others out there who have, or will have the same problem - so here's a very short, very simple solution:

#!/usr/bin/perl -w use warnings; use strict; binmode(STDIN,":bytes"); binmode(STDOUT,":bytes"); while(<>){ chomp; s/\302\240//g; s/\s+/ /g; print $_."\n"; }


This completely solved my problem. If it is incomplete, or not a very clever thing to do, please improve it. If it solves somebody elses problem as well - GREAT! joe

2006-09-14 Retitled by planetscape, as per Monastery guidelines

( keep:0 edit:12 reap:0 )

Original title: 'Annoying Problem: solved'

Comment on A Regex for no-break space Unicode Entities
Download Code
Re: A Regex for no-break space Unicode Entities
by bart (Canon) on Sep 13, 2006 at 09:51 UTC
    If your file contains "\302\240" for chr(160), that means to me that the file is in UTF8. So if you'd binmode the source file as ":utf8", then you could just scan for "\240". In theory, it's a better (= fewer possible nasty surprises) solution.

    Somehow, I don't think replacing every nbsp with nothing is such a good idea. I'd leave a space in its place. Otherwise, you'd end up joining words into one, that should remain separate.

      In theory, it's a better (= fewer possible nasty surprises) solution.
      I see more possible nasty surprises. Can you elaborate?
        I see more possible nasty surprises. Can you elaborate?
        Huh? Can you elaborate?

        The theoretical danger is that by matching individual bytes instead of characters, you might inadvertently match bytes that actually belong to other characters. And by changing just a few bytes instead of the whole sequence making up a character, you might even be creating invalid UTF8.

        Of course, one of the reasons for the popularity of UTF8 (as opposed to Windows native "2 bytes for each character") is that it's resyncing, it's always possible to recognize start and continuation bytes for multibyte characters, so this problem isn't as stringent as it could have been using other multibyte character representations.

        There are no whitespace characters with a character code of 128 or above, nbsp (160) is the only almost-whitespace character I know of in that situation. So for this particular application, you're probably in the clear.

        Still, there's danger lurking in treating byte sequences in a different manner than intended — thus, treating UTF8 as a byte sequence.

Re: A Regex for no-break space Unicode Entities
by graff (Chancellor) on Sep 13, 2006 at 13:00 UTC
    bart is right -- this is a cleaner, safer way:
    #!/usr/bin/perl -w use warnings; use strict; binmode(STDIN,":utf8"); binmode(STDOUT,":utf8"); while(<>) { # if you just want to get rid of non-breaking spaces, do this: tr/\xA0/ /; # if you really want to change every kind of whitespace and every stri +ng # of two or more whitespace to a single space, do this instead: s/\s+/ /g; # in utf8 strings, \s matches non-breaking space s/ $/\n/; # (puts back the \n at the end of the line) print; }
    (updated to remove incorrect use of "g" modifier on tr///)
      # of two or more whitespace to a single space, do this instead: s/\s+/ /g; # in utf8 strings, \s matches non-breaking space I read this on a webpage somewhere, but for one reason or another, it did not produce the desired results. The binmode utf8 thing did not work either. Though more unpredictable, and for reasons I cannot completely explain, the byte mode solution was the only one I could get to produce the desired results.
        but for one reason or another, it did not produce the desired results.

        It would be neat if you could show a minimal self-contained example to demonstrate this. It could be you were still missing something simple, like you did binmode STDOUT, ":utf8"; but then actually read your input from some other file handle (e.g. ARGV), instead of actually piping or redirecting data to the script. And see what the results actually were could help as well.

Re: A Regex for no-break space Unicode Entities
by kettle (Beadle) on Sep 13, 2006 at 13:38 UTC
    Both comments: thanks, and very true. It is definitely better to change it to an ordinary whitespace character first, then to the subsequent reduction. For my data this didn't happen to be problem *thankfully* but in general that is definitely a better practice, and what I would have implemented had it ocurred to me at the time. Thanks!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://572690]
Approved by BrowserUk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (10)
As of 2014-12-19 13:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (83 votes), past polls