Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

How to Remove Junk Characters

by Rajeshk (Scribe)
on Jan 05, 2006 at 09:46 UTC ( #521155=perlquestion: print w/ replies, xml ) Need Help??
Rajeshk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,
I have problem while downloading HTML files using LWP::UserAgent.
There are some Junks Characters found in downloaded files.
Is any way to download the file without junks.
Note:I am using in Windows OS. Download the webpage to see junk Characters 'http://www.whitecase.com/attorneys/detail.aspx?attorney=1148';

Here are some sample junk characters Downloaded files Input -- Original Output =========================================== 1. jury trial. For his -- jury trial. For his 2. Börries Ahrens -- Börries Ahrens 3. Aldejohann’s main -- Mr. Aldejohann’s 4. University of MĂĽnster -- University of Münster 5. the €625 million senior and €130 -- €625 million senior and €1 +30 6. acquisition of a properties’ -- acquisition of a properties’ 7. Westfield College – University -- Westfield College – University + 8. TelĂ©fonos -- Teléfonos 9.(CelumĂłvil S -- (Celumóvil S 10. Dr. jur., 1990, with a dissertation on “Die Unabhängigkeit des +genossenschaftlichen PrĂĽfungsverbandes” (“The Independence of th +e Cooperative Inspection Association”) --- Dr. jur., 1990, with a dissertation on "Die Unabhängigkeit des genosse +nschaftlichen Prüfungsverbandes" ("The Independence of the Cooperativ +e Inspection Association")

Here is my try:

use LWP::UserAgent; my $ua = new LWP::UserAgent; $ua->proxy(['http']=> 'http://00.00.0.00:0000'); my $url = 'http://www.whitecase.com/attorneys/detail.aspx?attorney=11 +48'; # Create a request my $req = HTTP::Request->new('GET' => $url); $req->proxy_authorization_basic("xxxxx", "xxxxx"); my $res = $ua->request($req); if ($res->is_success) { my $file_cnt = $res->content; print "$file_cnt"; open WOUT, ">out.html" or die "Can't open File: out.html"; print WOUT $file_cnt; close WOUT; } else { print "Download Error\n"; }


Thanks & Regards,
Rajesh.K

Comment on How to Remove Junk Characters
Select or Download Code
Re: How to Remove Junk Characters
by wfsp (Abbot) on Jan 05, 2006 at 10:18 UTC

    Could you show us what it looks like?

    Post a sample to give us some idea.

    Update:
    Try this

    my $file_cnt = $res->content; $file_cnt =~ s/\r//g;

    update:
    see below

      Hi wfsp,

      I tried your code. $file_cnt =~ s/\r//g;
      It's not working.


      Thanks,
      Rajesh.K

Re: How to Remove Junk Characters
by zentara (Archbishop) on Jan 05, 2006 at 12:31 UTC
    I took out your $ua->proxy line and your code runs fine. The out.html has no corruption. I'm on linux using Mozilla.

    I'm not really a human, but I play one on earth. flash japh
      He may mean this
      <title>....Dr. Börries.....</title>
      It should, of course, be "Börries"
      I tried
      binmode(STDOUT, ':utf8');
      with no success. Any idea what's happening?

      wfsp

      update:
      see below

        This works:
        use Unicode::String qw(utf8); #.... print utf8($file_cnt);


        holli, /regexed monk/
Re: How to Remove Junk Characters
by abcde (Scribe) on Jan 05, 2006 at 13:33 UTC

    I am not sure what you mean by "junk" characters. May you post an example of what you mean?

    I took out the proxy lines and ran the code; the file downloaded without any errors. However, I think you are referring to accented characters such as ö in the source - Use HTML::Entities if you want to encode them into the proper &ouml; format.

    But please post an example so we can be sure of what you want.

    ~abseed

      Hi Monks,

      Here are some sample junk characters Downloaded files Input -- Original Output =========================================== 1. jury trial. For his -- jury trial. For his 2. Börries Ahrens -- Börries Ahrens 3. Aldejohann’s main -- Mr. Aldejohann’s 4. University of MĂĽnster -- University of Münster 5. the €625 million senior and €130 -- €625 million senior and €1 +30 6. acquisition of a properties’ -- acquisition of a properties’ 7. Westfield College – University -- Westfield College – University + 8. TelĂ©fonos -- Teléfonos 9.(CelumĂłvil S -- (Celumóvil S 10. Dr. jur., 1990, with a dissertation on “Die Unabhängigkeit des +genossenschaftlichen PrĂĽfungsverbandes” (“The Independence of th +e Cooperative Inspection Association”) --- Dr. jur., 1990, with a dissertation on "Die Unabhängigkeit des genosse +nschaftlichen Prüfungsverbandes" ("The Independence of the Cooperativ +e Inspection Association")

      Thanks,
      Rajesh.K

        Change
        my $file_cnt = $res->content;
        to
        my $file_cnt = $res->decoded_content;

        See HTTP::Message for an explanation of the difference.

        Many thanks to the search artist kwapping for finding it and to tye for explaining it :-)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://521155]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (7)
As of 2014-11-25 00:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (148 votes), past polls