Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Perl not printing any special characters in array

by myfrndjk (Sexton)
on Jun 21, 2014 at 23:47 UTC ( [id://1090802]=perlquestion: print w/replies, xml ) Need Help??

myfrndjk has asked for the wisdom of the Perl Monks concerning the following question:

Hi i wish to scrape the content and store that in its respective names.When I prints the crawl content it doesn't print any special characters.All special characters are replaced by some junk values. for example (€)euro is printed as (-aA). I am scraping the site which is full of special characters and German language. So most of the crawled content are different from original content.Thanks in advance

use LWP::Simple; use File::Compare; use HTML::TreeBuilder::XPath; use LWP::UserAgent; open(FILE, "C:/Users/jk/Desktop/input/input.txt"); { while(<FILE>) { chomp; $url=$_; foreach ($url) { ($domain) = $url =~ m|www.([A-Z a-z 0-9]+.{3}).|x; } do 'C:/Users/jk/Desktop/perl/mainsub.pl'; &domain_check(); my $ua = LWP::UserAgent->new(agent => "Mozilla/5.0"); my $req = HTTP::Request->new(GET => "$url"); my $res = $ua->request($req); die("error") unless $res->is_success; my $xp = HTML::TreeBuilder::XPath->new_from_content($res->content) +; my @node = $xp->findnodes_as_strings("$xpath"); die("node doesn't exist") if $#node == -1; foreach(<@node>) { $death=$_; open HTML ">C:/Users/jk/Desktop/fun/perl/$site.html"; print HTML "$death\n"; } } }

subroutine

use LWP::Simple; use File::Compare; use HTML::TreeBuilder::XPath; use LWP::UserAgent; sub domain_check { sub domain_check { if($domain eq 'goo.eu') { $competitor = 'goo.eu'; $xpath ='//p/strong' } if ($domain eq 'mov.it') { $competitor = 'mov.it'; $xpath = '//div//table//td'; } elsif ($domain eq 'lot.it') { $competitor = 'lot.it'; $xpath = '//div//table'; } }

Replies are listed 'Best First'.
Re: Perl prints only last line of array
by RMGir (Prior) on Jun 22, 2014 at 00:09 UTC
    I think that your problem is that you're re-creating the file for every line.

    Try moving the "open HTML" line out above the loops...

    # You don't want to do this for each line! open HTML ">C:/Users/jk/Desktop/fun/perl/$site.html"; foreach(<@node>) { $death=$_; print HTML "$death\n"; }

    Mike

      Hi Mike, Thanks for your help ! I modified the code as you suggested it works fine.But now i have another issue .The crawl content doesn't contain any special characters.All special characters are replaced by some junk values. for example (€)euro is printed as (-aA). I am scraping the site which is full of special characters and German language. most of the crawled content are different from original content. Thanks again, jk

Re: Perl not printing any special characters in array
by hippo (Bishop) on Jun 22, 2014 at 15:03 UTC
    All special characters are replaced by some junk values. for example (€)euro is printed as (-aA).

    This suggests that you have forgotten to decode your input or to encode your output or both. Have you read perlunitut and perlunifaq?

      Hi, Thanks for your suggestion,but i don't know where to add those in my code.I tried but still no change in result.can you tell me where to add those encode/decode in my code. Thanks

        Useless reply, myfrndjk; show us the code you "tried" (presumably, 'added') and tell us, in detail, how it failed.

        .

        Questions containing the words "doesn't work" (or their moral equivalent) will usually get a downvote from me unless accompanied by:
        1. code
        2. verbatim error and/or warning messages
        3. a coherent explanation of what "doesn't work actually means.

        check Ln42!

Re: Perl prints only last line of array
by AnomalousMonk (Archbishop) on Jun 22, 2014 at 10:02 UTC
    ($domain) = $url =~ m|www.([A-Z a-z 0-9]+.{3}).|x;

    Just a side note: The regex quoted above is unlikely to be doing what you expect.

    Update: I'm not familiar with URL matching in general, but I cannot imagine this problem has not already been addressed in a Perl module — or modules! Maybe search CPAN or MetaCPAN with terms like  URL regex

      Hi, Thanks for you explanation.But my regex is working fine after a code change suggested by mike it works fine. Thanks JK

        c:\@Work\Perl\monks>perl -wMstrict -le "my $url = q{is 'www' really a domain?!?}; print qq{($1)} if $url =~ m|www.([A-Z a-z 0-9]+.{3}).|x; " ( really a domain?!)
Re: Perl prints only last line of array
by Anonymous Monk on Jun 22, 2014 at 00:30 UTC
    What do you think  foreach(<@node>) does?

      Hi, I am new to PERL.What I thought is for every url I have to open new HTML.Now I understood it works for every line.So it replaced all the previous words and left the last line.Correct me if I am wrong. Thanks jk

        see readline and glob, <> is used for both reaedline and glob, its not used for iterating over an array

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1090802]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-04-18 01:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found