Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

stripping characters from html

by jonnyfolk (Vicar)
on Aug 03, 2010 at 13:05 UTC ( #852653=perlquestion: print w/ replies, xml ) Need Help??
jonnyfolk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am transforming html files into xml data sheets and I am finding certain characters are breaking the script. I am trying to filer these out but I fear I am being reactive rather than proactive and it's really a case of waiting for a character to break the script. Is there a generic approach or module to try to trap these things before they crop up?

$content =~s/s//gi; $content =~s/H//gi; $content =~s/t//gi; $content =~s/a//gi; $content =~s/∫s//gi; $content =~s/X//gi;

Lindsay Ct
United States
Lindsays love of travel

Comment on stripping characters from html
Download Code
Re: stripping characters from html
by almut (Canon) on Aug 03, 2010 at 13:29 UTC
    I am finding certain characters are breaking the script.

    In what way are they breaking the script?  Maybe you just need to entity-encode those characters (preferably use numeric entities (encode_entities_numeric()), as in contrast to HTML, in XML only very few named entities are predefined (i.e. work without explicit entity declarations)).   Does ∫ really cause an error?

    Alternatively, try specifying an appropriate encoding (in the first line of the XML file: <?xml version="1.0" encoding="..."?>).

    Or, as a last resort, simply strip everything outside of the ASCII range.

Re: stripping characters from html
by graff (Chancellor) on Aug 03, 2010 at 21:01 UTC
    If your goal is to create an XML output whose content is an imperfect and incomplete copy of the original HTML text data (i.e. with an indeterminate amount of corruption due to loss of content), then a "generic approach" for implementing what almut aptly calls the "last resort" solution is a simple regex, applied to the HTML text data:
    s/[^\x00-\x7f]+//g;
    That is, every byte/character outside the ASCII range will be deleted, regardless whether your perl script happens to be handling the data as bytes or as characters.

    A better approach would be to understand what the character encoding of the incoming HTML data really is (and watch out for those HTML character entities that turn into non-ascii characters, like &trade; &eacute; &nbsp; and so on). Make sure you do everything necessary to turn the text into "pure" utf8 strings (using HTML::Entities::decode_entities), and then output the XML with proper utf8 encoding, or else convert all non-ascii characters to their numeric character entities, as almut suggested above.

    There's probably a module for converting characters to numeric entities, but the basic process is:

    s/([^\x00-\x7f])/sprintf("&#%d;",ord($1))//eg;
    (update: added a missing "#" in the sprintf format string)

    But personally, I prefer having XML files with utf8 text in them.

    In either case, perl has to know that the string contains utf8 characters, so it can treat it as (multi-byte) characters, rather than as bytes. And that means that you've read the data from a file handle using a ":utf8" IO layer, or that you've used Encode::decode to convert the text to utf8.

      I agree with keeping the stuff utf8, etc.

      s/[^\x00-\x7f]+//g;

      may be more readable as (what I believe is the equivalent POSIX class)-

      s/[^[:ascii:]]+//g;
        ... may be more readable as (what I believe is the equivalent POSIX class) ...

        Right -- and I totally agree (and yes I'm pretty sure the POSIX expression is equivalent). But "more readable" can be different things to different people; e.g. a specific numeric range can lead to less uncertainty or doubt, compared to having to recall the exact syntax and meaning of an expression consisting of extra punctuation around a term that tends to be misused or misunderstood by less experienced programmers...

Re: stripping characters from html
by pemungkah (Priest) on Aug 04, 2010 at 00:53 UTC
    The key bit for me here is "transforming HTML files into XML data sheets". The XML parser definition says that anything not in the ASCII range (unless specifically set in the header) is bad (and if you putatively have ASCII, but for some reason there are characters outside the normal ASCII range - say you're doing a test of a web crawler that hits a page with a Unicode snippet on it) --- and for an XML parser, "bad character" means "fall over dead".

    I had this problem turning TAP output containing Unicode into JUnit XML; the solution was to translate any character outside of the printable ASCII range (except for newline and carriage return) to an &#xx; sequence, and the same for any embeded ", &, <, and > characters. The XML parser was then happy with that character set.

    This is useful if you can't trust the encoding of the input to be right, since ASCII is kind of the "least common denominator" when it comes to character sets.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://852653]
Approved by marto
Front-paged by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (8)
As of 2014-12-21 09:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (104 votes), past polls