Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

LWP problem

by coltman (Acolyte)
on Sep 08, 2008 at 15:30 UTC ( [id://709792]=perlquestion: print w/replies, xml ) Need Help??

coltman has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

I met a weird case when I tried to download the webpage "http://securities.stanford.edu/1008/UTIIQ96/" using LWP::Simple::get() and save the $content to a txt file.

The weird thing is that if I open the txt file using some editor (e.g., UltraEdit), it shows perfectly normal:

<HTML><HEAD><TITLE>Unitech Industries, Inc. - Securities Class Action</TITLE>

However, if I use "print $content" during the downloading . The log shows something differently:

< H T M L > < H E A D > < T I T L E > U n i t e c h I n d u s t r i e s , I n c . - S e c u r i t i e s C l a s s A c t i o n < / T I T L E >

It just adds a space after every character.

When I try to use regex to extract information, the space issue just haunted me all the time as perl will always read the txt file as if it has the extra space!

I will appreciate it if someone can give me some hint on the cause and solution to the problem.

Thank you!

Replies are listed 'Best First'.
Re: LWP problem
by kyle (Abbot) on Sep 08, 2008 at 15:50 UTC

    My browser thinks that page is "UTF-16 (Little endian)" encoded. I did this:

    use LWP::Simple; use Encode; my $p = get( 'http://securities.stanford.edu/1008/UTIIQ96/' ); my $d = decode( 'UTF-16LE', substr( $p, 2 ), 1 );

    After that, $d comes out without all the null characters. I'm using a substr of $p so as to skip over the Byte-order mark.

Re: LWP problem
by betterworld (Curate) on Sep 08, 2008 at 15:40 UTC

    It seems like that document is in UTF-16 encoding. Try this:

    use LWP::Simple; use Encode; my $x = get("http://securities.stanford.edu/1008/UTIIQ96/"); print length $x, "\n"; # prints 59810 $x = decode("utf-16", $x); print length $x, "\n"; # prints 29904 print $x; # prints the document
Re: LWP problem
by deus.lemmus (Initiate) on Sep 10, 2008 at 13:31 UTC
    That looks like your seeing the text as UTF16 in the second case. You could try running it through something of the Encode::Decode family to convert it to UTF8 (or some other encoding) if you need to.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://709792]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2026-02-18 09:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    hippoepoptai's answer Re: how do I set a cookie and redirect was blessed by hippo!
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.