Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

How to encode after using HTML::Strip

by myfrndjk (Sexton)
on Aug 28, 2014 at 18:36 UTC ( #1098904=perlquestion: print w/ replies, xml ) Need Help??
myfrndjk has asked for the wisdom of the Perl Monks concerning the following question:

Hi I am trying to encode the html page using cp1252 since it has lot of special characters like and pounds but when i try to save those contents after using HTML::strip .All were displayed as junk values.I tried to encode using cp1252 but its not working.Please help me to fix the issue.

use strict; use warnings; use HTML::TreeBuilder::XPath; use LWP::UserAgent ; use HTTP::Request ; use HTML::Entities; use HTML::Strip; use encoding "cp1252"; open (OUT, '>:encoding(cp1252)',"/home/local/ANT/jeyakuma/Desktop/test +.html"); my $URL = 'http://www.footlocker.eu/it/it/k/Customer-Service/Shipping. +aspx'; my $agent = LWP::UserAgent->new(agent => "Mozilla/5.0"); my $request = HTTP::Request->new(GET=> $URL); my $response = $agent->request($request); # Check the outcome of the response if ($response->is_success) { my $xp = HTML::TreeBuilder::XPath->new_from_content($response->decoded +_content); my $raw_html = $xp->findnodes_as_string('//div[@class="faq_text"]/p/st +rong/u[contains(.,\'spedizione Standard \')]'); my $hs = HTML::Strip->new(); my $clean_text = $hs->parse( $raw_html ); $hs->eof; print OUT $clean_text; } elsif ($response->is_error) { print "Error:$URL\n"; print $response->error_as_HTML; }

Expected output : 60/

current output : £ 60/

Comment on How to encode after using HTML::Strip
Download Code
Re: How to encode after using HTML::Strip
by Loops (Curate) on Aug 28, 2014 at 20:40 UTC
    Hi, HTML::Strip has this bug Bug #42834 for HTML-Strip: HTML::Strip breaks UTF-8 The quoted workaround makes your code work as you desire:
    use Encode; use utf8; sub parse_workaround { my $html = shift; my $hs = HTML::Strip->new(); my $octets = encode_utf8($html); utf8::downgrade($octets); my $stripped = $hs->parse($octets); $hs->eof; return decode_utf8($stripped); }
    And subbing in your original code:
    my $clean_text = parse_workaround( $raw_html ); # my $hs = HTML::Strip->new(); # my $clean_text = $hs->parse( $raw_html ); # $hs->eof;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1098904]
Front-paged by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (17)
As of 2014-12-18 14:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (56 votes), past polls