Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

Re^2: replace string

by sandy1028 (Sexton)
on May 18, 2009 at 09:21 UTC ( #764607=note: print w/replies, xml ) Need Help??

in reply to Re: replace string
in thread replace string

How to convert XHTML entity characters to XML form

Replies are listed 'Best First'.
Re^3: replace string
by Anonymous Monk on May 18, 2009 at 09:24 UTC
    No need, XHTML is XML
      The input is
      <b></b>Officially called <>“events,”< +/a> as "never events"
      the string should be converted to
      <b></b>Officially called <>“events,”< +/a>as "never events"
      How can I convert only such characters.
        use strict; use warnings; my $input = "<b></b>Officially called <>“event +s,”</a> as "never events""; print "input: $input\n"; $input =~ s/“/“/g; # change the lines $input =~ s/”/”/gi; $input =~ s/’/’/gi; print "processed input: $input\n";


        input: <b></b>Officially called <>“events,&rdq +uo;</a> as "never events" processed input: <b></b>Officially called <>“e +vents,”</a> as "never events"

        This was done using your REs and appears to provide the output you are looking for.

        Better idea would be to provide hex dump of data, so we know what the actual bytes are
        echo |hexdump 00000000: 45 43 48 4F 20 69 73 20 - 6F 6E 2E 0D 0A |ECHO is o +n. | 0000000d;
        echo |od -tx1 0000000 45 43 48 4f 20 69 73 20 6f 6e 2e 0d 0a 0000015

      In case anybody visits this node again, I have a guess what sandy1028 is talking about. HTML (and XHTML) define a set of named character entities such as  . Generic XML parsers will not recognize these entities because they are application specific. So he or she needs, for whatever reason, to translate XHTML-specific named character entities to their corresponding numeric character entities for use in some non-HTML XML application.

      A few years ago I had to do this myself when reformatting some good old-fashioned HTML into something that could be used in an XSLT stylesheet. Man was that a pain.

      How to encode or decode ’ to ’

        If you provide sample data you may get more specific guidance. Your description is a bit too vague for anyone to be certain what you have as input and what you want as output. There are many possibilities.

        In addition to the suggestions already given, you may find perlunifaq and Encode helpful. I suspect you don't need Encode for what you are trying to do, but these will give you terminology and context to help you understand about encodings in general and about perl's internal representation of strings, which may be what you are trying to manipulate.

        Perl regular expressions support escape sequences that allow you to specify fairly arbitrary values in your string, including Unicode code points.

        \033 octal char (example: ESC) \x1B hex char (example: ESC) \x{263a} long hex char (example: Unicode SMILEY) \cK control char (example: VT) \N{name} named Unicode character

        It may be that all you need to do is specify the correct characters in your RE, using one of the escapes (probably long hex char or named Unicode character, depending on your preference). But it is possible you will have to decode your input first.

        If you use Devel::Peek's Dump to dump your input data and post that, then you might get more specific advice.

        Buy an encoder?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://764607]
and the shadows deepen...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (10)
As of 2017-09-26 06:03 GMT
Find Nodes?
    Voting Booth?
    During the recent solar eclipse, I:

    Results (292 votes). Check out past polls.