stripping characters from html

jonnyfolk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: stripping characters from html by almut (Canon) on Aug 03, 2010 at 13:29 UTC
I am finding certain characters are breaking the script. In what way are they breaking the script? Maybe you just need to entity-encode those characters (preferably use numeric entities (`encode_entities_numeric()`), as in contrast to HTML, in XML only very few named entities are predefined (i.e. work without explicit entity declarations)). Does `∫` really cause an error? Alternatively, try specifying an appropriate encoding (in the first line of the XML file: `<?xml version="1.0" encoding="..."?>`). Or, as a last resort, simply strip everything outside of the ASCII range.	[reply] [d/l] [select]
Re: stripping characters from html by graff (Chancellor) on Aug 03, 2010 at 21:01 UTC
If your goal is to create an XML output whose content is an imperfect and incomplete copy of the original HTML text data (i.e. with an indeterminate amount of corruption due to loss of content), then a "generic approach" for implementing what almut aptly calls the "last resort" solution is a simple regex, applied to the HTML text data: `s/[^\x00-\x7f]+//g;` [download] That is, every byte/character outside the ASCII range will be deleted, regardless whether your perl script happens to be handling the data as bytes or as characters. A better approach would be to understand what the character encoding of the incoming HTML data really is (and watch out for those HTML character entities that turn into non-ascii characters, like `™ é  ` and so on). Make sure you do everything necessary to turn the text into "pure" utf8 strings (using HTML::Entities::decode_entities), and then output the XML with proper utf8 encoding, or else convert all non-ascii characters to their numeric character entities, as almut suggested above. There's probably a module for converting characters to numeric entities, but the basic process is: `s/([^\x00-\x7f])/sprintf("&#%d;",ord($1))//eg;` [download] (update: added a missing "#" in the sprintf format string) But personally, I prefer having XML files with utf8 text in them. In either case, perl has to know that the string contains utf8 characters, so it can treat it as (multi-byte) characters, rather than as bytes. And that means that you've read the data from a file handle using a ":utf8" IO layer, or that you've used Encode::decode to convert the text to utf8.	[reply] [d/l] [select]
Re^2: stripping characters from html by Your Mother (Archbishop) on Aug 03, 2010 at 21:07 UTC
I agree with keeping the stuff utf8, etc. `s/[^\x00-\x7f]+//g;` may be more readable as (what I believe is the equivalent POSIX class)- `s/[^[:ascii:]]+//g;`	[reply] [d/l] [select]
Re^3: stripping characters from html by graff (Chancellor) on Aug 04, 2010 at 02:55 UTC
... may be more readable as (what I believe is the equivalent POSIX class) ... Right -- and I totally agree (and yes I'm pretty sure the POSIX expression is equivalent). But "more readable" can be different things to different people; e.g. a specific numeric range can lead to less uncertainty or doubt, compared to having to recall the exact syntax and meaning of an expression consisting of extra punctuation around a term that tends to be misused or misunderstood by less experienced programmers...	[reply]
Re^4: stripping characters from html by GrandFather (Saint) on Aug 04, 2010 at 09:11 UTC
Re^4: stripping characters from html by Anonymous Monk on Aug 05, 2010 at 15:40 UTC
Re: stripping characters from html by pemungkah (Priest) on Aug 04, 2010 at 00:53 UTC
The key bit for me here is "transforming HTML files into XML data sheets". The XML parser definition says that anything not in the ASCII range (unless specifically set in the header) is bad (and if you putatively have ASCII, but for some reason there are characters outside the normal ASCII range - say you're doing a test of a web crawler that hits a page with a Unicode snippet on it) --- and for an XML parser, "bad character" means "fall over dead". I had this problem turning TAP output containing Unicode into JUnit XML; the solution was to translate any character outside of the printable ASCII range (except for newline and carriage return) to an `&#xx;` sequence, and the same for any embeded ", &, <, and > characters. The XML parser was then happy with that character set. This is useful if you can't trust the encoding of the input to be right, since ASCII is kind of the "least common denominator" when it comes to character sets.	[reply]


Perl Monk, Perl Meditation
	PerlMonks