Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Mysterious Whitespaces between each character in a file

by 1wax (Novice)
on Oct 08, 2009 at 13:35 UTC ( #799963=perlquestion: print w/ replies, xml ) Need Help??
1wax has asked for the wisdom of the Perl Monks concerning the following question:

Reading lines from an MOF file and printing to STDOUT script produces lines with whitespaces between each character.
Trying to use s/\s+//g does not remove the spaces.

Because of the spaces its difficult to match any of the lines with values within the script.
Does anybody know how the file can be read and printed without the extra spaces.
Running a type on the file does not show anything untoward.

Perl version is 5.6.1 on windows 2k3.

#pragma namespace("\\\\.\\Root\\HewlettPackard\\openview\\data")
instance of OV_NodeGroup
Caption = "xxxxx";
Description = "xxxxx";
GraphCategory = "";
. . .

Comment on Mysterious Whitespaces between each character in a file
Re: Mysterious Whitespaces between each character in a file
by Anonymous Monk on Oct 08, 2009 at 13:41 UTC
    it is probably UCS2 file
Re: Mysterious Whitespaces between each character in a file
by Unforgiven (Hermit) on Oct 08, 2009 at 14:07 UTC
    Have you tried printing out the file and looking at it with a hex editor? Maybe it'll help knowing exactly what that character is, then you could try tracking down where it's coming from (or just regex it out).

      Agreed. One possibility is that is contains a non-breaking space (ASCII code A0). /\s/ does not match this. Looking at the data with a hex editor will tell you if this is so.

Re: Mysterious Whitespaces between each character in a file
by ikegami (Pope) on Oct 08, 2009 at 14:07 UTC
    The file is encoded using UCS-2le
    open(my $fh, '<:encoding(UCS2-le)', $fn)

    You'll need 5.8 or higher for the above command. Perl 5.6 didn't support Unicode and encodings well. Keep in mind that 5.6.1 is 8.5 years old, 5.8 is no longer maintained and 5.10.1 is out. Sorry, I can't help you with a 5.6 solution.

    Update: Added last paragraph

Re: Mysterious Whitespaces between each character in a file (hack for 5.6.x)
by almut (Canon) on Oct 08, 2009 at 15:57 UTC

    (Presuming the file actually is in UCS-2le or UTF-16le encoding (which is likely) ...)

    If you need/want to stick with 5.6.1, you could use the following crude hack:

    $/="\n\0"; while (my $line = <>) { print pack("C*", map $_ & 0xff, unpack("v*",$line)); }

    This would simply remove all the high-bytes (what appears as extra "spaces" — actually those spaces are zero bytes for all chars with ordinal value <= 0xff).  As the sample text you've shown only seems to contain plain ASCII characters, this approach should work pretty well.

    Another option with 5.6.1 would be the module Unicode::String:

    use Unicode::String qw(utf16le); $/="\n\0"; while (my $line = <>) { print utf16le($line)->latin1(); # or, if you want UTF-8 output: # print utf16le($line)->utf8(); }

    The problem with Unicode::String is that it doesn't ship with 5.6.1 by default, so you'd somehow have to get hold of it (for v5.6.1!), or build it yourself. OTOH, as Unicode::String is an XS module that needs a working compiler environment set up, etc., I would not recommend the latter (unless you're familiar with the procedure...). It's most likely easier to use the crude hack...

    (I tried both approaches with an old perl-5.6.0, so I'm pretty sure they should work with 5.6.1, too)

      $/="\n\0"; will fail if the file contains character U+0Axx followed by U+yy00 (for any values "xx" and "yy").

      Also, you should replace characters outside iso-latin-1 with some fixed character (such as "?") rather than some random character.

      This fixes both problems:

      local $/ = "\x0A\x00"; for ( my $line = ''; defined( $_ = <> ); $line = '' ) { $line .= $_; redo if length($line) % 2 != 0; print pack 'C*', map { $_ <= 0xFF ? $_ : '?' } unpack 'v*', $line; # -or- # print utf16le($line)->latin1(); }

      (Assumes each file in @ARGV is properly formed, i.e. contain an even number of bytes.)

        $/="\n\0"; will fail if the file contains character U+0Axx followed by U+yy00

        This is correct (same holds for "\x0A\x00", btw).  However, as U+0Axx is Gurmukhi/Gujarati, this is rather unlikely to happen in the OP's case... (Also, a characteristic of a "crude hack" (as I called it) is, that it would work in most practical cases, but isn't failsafe, theoretically).

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://799963]
Approved by ww
Front-paged by toolic
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (6)
As of 2014-08-23 04:35 GMT
Find Nodes?
    Voting Booth?

    The best computer themed movie is:

    Results (172 votes), past polls