Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

How to tell if a stream is already in UTF8 mode?

by perl-diddler (Hermit)
on Jan 02, 2014 at 18:07 UTC ( #1068999=perlquestion: print w/ replies, xml ) Need Help??
perl-diddler has asked for the wisdom of the Perl Monks concerning the following question:

I have a lower level routine that can be passed a file handle to do output on.

How can my lower level routine know what format the stream is in (binary or unicode-text), so if the stream is in unicode, it can output unicode chars as '>8-bit values', OR if it is in 'binary', it can encode such values as UTF-8 so perl will see the stream as binary (and not complain about 'wide chars in output' -- and then do the conversion for me -- which seems to be it's current behavior).

I.e. If my lower level routine is about to print 'pi' in unicode, and the stream is a unicode stream, I'd print "\x{3c0}", but if it is a binary stream, I'd print "\xcf\x80" so the output would show 'π' in either case, w/o warnings.

I don't see a documented way of determining the current mode of a stream -- I need a "query-format" of the binmode directive so my 'blind' subroutine can _try_ to generate correct output in the face of unknown or random (user-based) input.

Thanks! ;-)

Comment on How to tell if a stream is already in UTF8 mode?
Re: How to tell if a stream is already in UTF8 mode?
by VincentK (Beadle) on Jan 02, 2014 at 20:49 UTC
Re: How to tell if a stream is already in UTF8 mode?
by aitap (Deacon) on Jan 02, 2014 at 21:53 UTC

    To me, this looks like an XY Problem. It doesn't sound like a good thing to make a "lower level routine" distinguish between different kinds of file handles with different IOLayers tied upon them. Where does your routine get filehandles from? Why they are opened in different modes? Wouldn't it be a better idea to binmode it some appropriate IOLayer like :utf8 or :encoding(...) on an upper level of subroutines?

    Anyway, you can use PerlIO::get_layers($fh) to get the list of layer names on a filehandle (got from PerlIO perldoc page). Checking whether filehandle is in utf-8 mode is then reduced to grepping for the "utf8" string (usually the last element of the array).

      Where does your routine get filehandles from? Why they are opened in different modes?

      It's a lower-level library formatting routine. Think of asking in "printf FH,...", "where does printf get its file handles from? Why would printf get FH's opened in different modes?"

      It gets the FH from user programs with FH coming from STD(OUT,ERR) or other opened destinations. By the time printf gets it, it doesn't know if the FH was set for unicode or binary. The lower level layers 'know', and will emit a warning if they detect chars > 255 on a stream NOT marked as UTF8, AND will not encode chars between 128 - 255, as UTF8 unless the stream was previously marked as UTF8.

      It doesn't sound like a good thing to make a "lower level routine" distinguish between different kinds of file handles with different IOLayers tied upon them.

      The problem isn't that it is a lower-level routine, but that it isn't "low enough"... I.e. the lower-I/O layers know if the stream had binmode called on the stream.

      Just guessing, now, but likely 'get_layers', may be the way, combined with a for loop to match -- matching only on the 1st char to eliminate possibilities and checking if the name (UTF-8 or utf8) is in a hash might give optimal perf-checks, then caching that as the state for that stream.

      It's a one way trip -- i.e. if the routine detects > 255-valued chars in the stream, it knows the stream "needs" to be in utf8 mode, but there aren't any single-byte values that would force a reverse (since all bytes can be part of a UTF-8 encoded data stream).

      Thanks for the pointer to get_layers...it's not documented on its own manpage...

        The lower level layers 'know', and will emit a warning if they detect chars > 255 on a stream NOT marked as UTF8, AND will not encode chars between 128 - 255, as UTF8 unless the stream was previously marked as UTF8.

        The lower level always expects bytes. (Files are blocks/streams of bytes.) It will ALWAYS emit a warning if it detects chars >255.

Re: How to tell if a stream is already in UTF8 mode?
by ikegami (Pope) on Jan 05, 2014 at 18:43 UTC
    Your design is awful, but you can use PerlIO::get_layers($fh) to look for the utf8 layer.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1068999]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2014-08-02 03:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Who would be the most fun to work for?















    Results (54 votes), past polls