Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

How to Remove Junk Characters

by Rajeshk (Scribe)
on Jan 05, 2006 at 09:46 UTC ( #521155=perlquestion: print w/ replies, xml ) Need Help??
Rajeshk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,
I have problem while downloading HTML files using LWP::UserAgent.
There are some Junks Characters found in downloaded files.
Is any way to download the file without junks.
Note:I am using in Windows OS. Download the webpage to see junk Characters 'http://www.whitecase.com/attorneys/detail.aspx?attorney=1148';

Here are some sample junk characters Downloaded files Input -- Original Output =========================================== 1. jury trial. For his -- jury trial. For his 2. Börries Ahrens -- Börries Ahrens 3. Aldejohann’s main -- Mr. Aldejohann’s 4. University of MĂĽnster -- University of Münster 5. the €625 million senior and €130 -- €625 million senior and €1 +30 6. acquisition of a properties’ -- acquisition of a properties’ 7. Westfield College – University -- Westfield College – University + 8. TelĂ©fonos -- Teléfonos 9.(CelumĂłvil S -- (Celumóvil S 10. Dr. jur., 1990, with a dissertation on “Die Unabhängigkeit des +genossenschaftlichen PrĂĽfungsverbandes” (“The Independence of th +e Cooperative Inspection Association”) --- Dr. jur., 1990, with a dissertation on "Die Unabhängigkeit des genosse +nschaftlichen Prüfungsverbandes" ("The Independence of the Cooperativ +e Inspection Association")

Here is my try:

use LWP::UserAgent; my $ua = new LWP::UserAgent; $ua->proxy(['http']=> 'http://00.00.0.00:0000'); my $url = 'http://www.whitecase.com/attorneys/detail.aspx?attorney=11 +48'; # Create a request my $req = HTTP::Request->new('GET' => $url); $req->proxy_authorization_basic("xxxxx", "xxxxx"); my $res = $ua->request($req); if ($res->is_success) { my $file_cnt = $res->content; print "$file_cnt"; open WOUT, ">out.html" or die "Can't open File: out.html"; print WOUT $file_cnt; close WOUT; } else { print "Download Error\n"; }


Thanks & Regards,
Rajesh.K

Comment on How to Remove Junk Characters
Select or Download Code
Re: How to Remove Junk Characters
by wfsp (Abbot) on Jan 05, 2006 at 10:18 UTC

    Could you show us what it looks like?

    Post a sample to give us some idea.

    Update:
    Try this

    my $file_cnt = $res->content; $file_cnt =~ s/\r//g;

    update:
    see below

      Hi wfsp,

      I tried your code. $file_cnt =~ s/\r//g;
      It's not working.


      Thanks,
      Rajesh.K

Re: How to Remove Junk Characters
by zentara (Archbishop) on Jan 05, 2006 at 12:31 UTC
    I took out your $ua->proxy line and your code runs fine. The out.html has no corruption. I'm on linux using Mozilla.

    I'm not really a human, but I play one on earth. flash japh
      He may mean this
      <title>....Dr. Börries.....</title>
      It should, of course, be "Börries"
      I tried
      binmode(STDOUT, ':utf8');
      with no success. Any idea what's happening?

      wfsp

      update:
      see below

        This works:
        use Unicode::String qw(utf8); #.... print utf8($file_cnt);


        holli, /regexed monk/
Re: How to Remove Junk Characters
by abcde (Scribe) on Jan 05, 2006 at 13:33 UTC

    I am not sure what you mean by "junk" characters. May you post an example of what you mean?

    I took out the proxy lines and ran the code; the file downloaded without any errors. However, I think you are referring to accented characters such as ö in the source - Use HTML::Entities if you want to encode them into the proper &ouml; format.

    But please post an example so we can be sure of what you want.

    ~abseed

      Hi Monks,

      Here are some sample junk characters Downloaded files Input -- Original Output =========================================== 1. jury trial. For his -- jury trial. For his 2. Börries Ahrens -- Börries Ahrens 3. Aldejohann’s main -- Mr. Aldejohann’s 4. University of MĂĽnster -- University of Münster 5. the €625 million senior and €130 -- €625 million senior and €1 +30 6. acquisition of a properties’ -- acquisition of a properties’ 7. Westfield College – University -- Westfield College – University + 8. TelĂ©fonos -- Teléfonos 9.(CelumĂłvil S -- (Celumóvil S 10. Dr. jur., 1990, with a dissertation on “Die Unabhängigkeit des +genossenschaftlichen PrĂĽfungsverbandes” (“The Independence of th +e Cooperative Inspection Association”) --- Dr. jur., 1990, with a dissertation on "Die Unabhängigkeit des genosse +nschaftlichen Prüfungsverbandes" ("The Independence of the Cooperativ +e Inspection Association")

      Thanks,
      Rajesh.K

        Change
        my $file_cnt = $res->content;
        to
        my $file_cnt = $res->decoded_content;

        See HTTP::Message for an explanation of the difference.

        Many thanks to the search artist kwapping for finding it and to tye for explaining it :-)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://521155]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (6)
As of 2014-09-22 21:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (202 votes), past polls