Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^2: Cleaning up non 7-bit Ascii Chars for XML-processing

by liverpole (Monsignor)
on Nov 11, 2010 at 19:20 UTC ( #870919=note: print w/ replies, xml ) Need Help??


in reply to Re: Cleaning up non 7-bit Ascii Chars for XML-processing
in thread Cleaning up non 7-bit Ascii Chars for XML-processing

Okay, thanks for the information.

But this line:

$field =~ s/([^\x20-\x7E])/sprintf("&#x%X;", ord($1))/eg;

Won't that convert anything greater than or equal to a "space" (ascii 0x20) up to 0xfe?  And why are you skipping 0xff?

And other than those things, isn't that really equivalent to what I was doing?  Except that, it appears you are outputting ’ where I was outputting \, and are those in fact equivalent?

Update:  I just realized I missed the fact that your regex has '^', so it negates those characters.  That makes a lot more sense.

But I would still ask, isn't the functionality of this subroutine the same as my original (other than the extra 'x' in your output "#xNN;", which I still think is wrong), or am I missing something else?


s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/


Comment on Re^2: Cleaning up non 7-bit Ascii Chars for XML-processing
Select or Download Code
Re^3: Cleaning up non 7-bit Ascii Chars for XML-processing
by ikegami (Pope) on Nov 11, 2010 at 20:55 UTC
    00-1F and 7F aren't printable, so I escape those too.

    And other than those things, isn't that really equivalent to what I was doing?

    Yes, it's the same except where it's not.

    Except that, it appears you are outputting ’ where I was outputting \, and are those in fact equivalent?

    So you still haven't fixed your bug. Start with that.

    and are those in fact equivalent?

    No. I guess that's yet another bug.

    • You're not decoding your inputs. (Not yet fixed.)
    • You're not encoding your outputs. (Not yet fixed.)
    • You confuse 92 hex for 92 decimal. (Fixed by using the function I posted.)
    • You're not outputting 7-bit clean as desired. (Fixed by using the function I posted.)
      ...but it seems that I misread. I thought you were generating the XML.

      The XML is always output as "UTF-8"

      No it isn't.

      "’" is "E2 80 99" in UTF-8.
      "’" is "92" in cp1252.

      You've indicated you have the latter.
      You've indicated the document claims to be the former (implicitly).

      You can either fix the encoding, or fix what the XML says the encoding is. The former is easier.

      use strict; use warnings; use Encode qw( encode decode ); sub fix_broken_text { my ($self, $field) = @_; $field =~ s/&/&amp;/g; $field =~ s/</&lt;/g; $field =~ s/>/&gt;/g; $field =~ s/"/&quot;/g; $field =~ s/'/&#39;/g; return $field; } my $decoded_xml; { open(my $fh, '<', $xml_qfn) or die; binmode($fh); local $/; $xml = decode('cp1252', scalar(<$fh>)); } ...Try to fix problems with unescaped characters... my $encoded_xml = encode('UTF-8', $decoded_xml); ...Pass $encoded_xml to parser...

      If only parts are cp1252,

      use strict; use warnings; use Encode qw( encode decode ); sub fix_broken_text { my ($self, $field) = @_; $field = decode('cp1252', $field); $field =~ s/&/&amp;/g; $field =~ s/</&lt;/g; $field =~ s/>/&gt;/g; $field =~ s/"/&quot;/g; $field =~ s/'/&#39;/g; $field = encode('UTF-8', $field); return $field; } my $encoded_xml; { open(my $fh, '<', $xml_qfn) or die; binmode($fh); local $/; } ...Try to fix problems with unescaped characters... ...Pass $encoded_xml to parser...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://870919]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (11)
As of 2014-07-23 18:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (149 votes), past polls