Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

enoeding iso 8859 issue within a datadump

by Perlbeginner1 (Scribe)
on Oct 06, 2012 at 11:09 UTC ( #997615=perlquestion: print w/ replies, xml ) Need Help??
Perlbeginner1 has asked for the wisdom of the Perl Monks concerning the following question:

with the Mechanize i get a dataset with the following set:

see a datachunk:

Loosdorftown Ledochowskastra�e 4 3382 Loosdorftown Telefonnummer: 02754 6257 FAX-Nummer: 02754 6257-4


linux-wyee:/home/martin/perl #
the script below gives back result like this one;
Loosdorf
Ledochowskastra▀e
3382 Loostown
Telefonnummer: 0002754 6257
FAX-Nummer: 0002754 6257-4

Well - we have following options here:

to print to a file instead of printing at the screen, we just have to change:

say $text;

to:

print $OUT_FILE $text;

Some explanations: where $OUT_FILE will be a filehandle for the output file that we will have to open before getting into the so called "for loop".

This would work for the code as it is, but it might be different if we are using the Text:CSV module which has probably dedicated functions or methods for printing CSV lines to a file (Well to be frank i don't use this module and don't know it, although I should probably change this because I am using CSV files from time to time . Well i try to describe more in details what we want to have: Which output file to look like. Well i want the comma to separate the fields of the addresses, or the records?


if we take this for example: katholisch.at

we have the following dataset:


well i want to have seperated each datset into these bits - in other words: if i have a dataset that delimiters and seperates the lines that are given like that

Loosdorf Ledochowskastra�e 4 3382 Loosdorf Telefonnummer: 02754 6257 FAX-Nummer: 02754 6257-4

i would be very very happy. Note: there also a Encoding issues is: see the Ledochowskastra�e - there is a sign in it "▀" so we have to take care for the iso 8859 encoding dont we!?


Well i love if you can give some hints and helping hands. That would be very very supportive. Note;: this is a great gerat chance f or me to learn alot about Perl, and the options and power of Mechanize.


see more results:
Marias Neustift Neustifttown 28 4443 Marias Neussstift Telefonnummer: 007250/204 FAX-Nummer: 07250/204-4 E-Mail: prre.inmarianeustift@dioezese-linz.at
Marias Puchheim Gmundnertown Stra�e 1b 4800 Attnanger-Puchheim Telefonnummer: 007674/62334 FAX-Nummer: 07674/62334-4 E-Mail: prre.inmariapuchheim@dioezese-linz.at
Marias Scharten Schartenstown 1 4612 Schartensbook Telefonnummer: 007272/5210
Marias Schmolln Maria Schmollntown 2 5241 Maria Schmolln Telefonnummer: 007743/2209-12 FAX-Nummer: 07743/2209-17 E-Mail: prre.inmariaschmolln@dioezese-linz.at
Mattighofen R�merstra�e 12 5230 Mattighofentown Telefonnummer: 007742/2273 0676/87765221 FAX-Nummer: 07742/2273-22 E-Mail: peipfarre.inmattighofen@dioezese-linz.at
Mauerkirchens Pfarrhofstra�e 4 5270 Mauerkirchentown Telefonnummer: 007724/2262



well you see - we ve have a encoding iso 8859 issue here.

waht can we do!? At the end of the day - i have to get all in a CVS formate

Comment on enoeding iso 8859 issue within a datadump
Re: enoeding iso 8859 issue within a datadump
by Anonymous Monk on Oct 06, 2012 at 11:11 UTC

    Note;: this is a great gerat chance f or me to learn alot about Perl, and the options and power of Mechanize.

    Obvious lie is obvious



      i can try the Text::CSV module too....

      The Text::CSV module provides functions for both parsing and producing CSV data. However, we'll focus on the parsing functionality here. The following code sample opens the prospects.csv file and parses each line in turn, printing out all the fields it finds.

      #!/usr/bin/perl use strict; use warnings; use Text::CSV; my $file = 'prospects.csv'; my $csv = Text::CSV->new(); open (CSV, "<", $file) or die $!; while (<CSV>) { if ($csv->parse($_)) { my @columns = $csv->fields(); print "@columns\n"; } else { my $err = $csv->error_input; print "Failed to parse line: $err"; } } close CSV;


      Running the code produces the following output:

      Name Address Floors Donated last year Contact Charlotte French Cakes 1179 Glenhuntly Rd 1 Y John Glenhuntly Pharmacy 1181 Glenhuntly Rd 1 Y Paul Dick Wicks Magnetic Pain Relief 1183-1185 Glenhuntly Rd 1 Y George Gilmour's Shoes 1187 Glenhuntly Rd 1 Y Ringo
      And by replacing the line:
      print "@columns\n";


      with:

      print "Name: $columns[0]\n\tContact: $columns[4]\n";


      we can get more particular about which fields we want to output. And while we're at it, let's skip past the first line of our csv file, since it's only a list of column names.

      #!/usr/bin/perl use strict; use warnings; use Text::CSV; my $file = 'prospects.csv'; my $csv = Text::CSV->new(); open (CSV, "<", $file) or die $!; while (<CSV>) { next if ($. == 1); if ($csv->parse($_)) { my @columns = $csv->fields(); print "Name: $columns[0]\n\tContact: $columns[4]\n"; } else { my $err = $csv->error_input; print "Failed to parse line: $err"; } } close CSV;


      Running this code will give us the following output:
      Name: Charlotte French Cakes Contact: John Name: Glenhuntly Pharmacy Contact: Paul Name: Dick Wicks Magnetic Pain Relief Contact: George<br> Name: Gilmour's Shoes Contact: Ringo




      well i can get some analogies what do you think!
Re: enoeding iso 8859 issue within a datadump
by moritz (Cardinal) on Oct 06, 2012 at 12:17 UTC
Re: enoeding iso 8859 issue within a datadump
by bart (Canon) on Oct 06, 2012 at 12:18 UTC
    to print to a file instead of printing at the screen, we just have to change:
    say $text;
    to:
    print $OUT_FILE $text;
    Well, that's ignoring the most important distinction between say and print: that say adds a newline at the end. And you can add a filehandle argument to say. So you'd better do:
    say $OUT_FILE $text;
    If you set
    $OUT_FILE = \*STDOUT;
    or even
    $OUT_FILE = select;
    then you don't even have to swap code out code.

    As far as your problem concerns: look at Perl I/O Layers, in particular the :utf8 and :encoding layers.

      hello dear bart

      many many thanks to you!! GREAT !! Well you help me with some very importing steps and insights into perl.

      btw. i have a major problem here - i have to get from text to CSV....see

      Loosdorf Ledochowskastra�e 4 3382 Loosdorf Telefonnummer: 0123 4567 FAX-Nummer: 00123 4567-4

      i would be very very happy. Note: there also a Encoding issues is: see the Ledochowskastra�e - there is a sign in it "▀" so we have to take care for the iso 8859 encoding dont we!?


      Well i love if you can give some hints and helping hands. That would be very very supportive. Note;: this is a great gerat chance f or me to learn alot about Perl, and the options and power of Mechanize.


      see more results:
      Marias Neustift Neustifttown 28 4443 Marias Neussstift Telefonnummer: 007250/204 FAX-Nummer: 07250/204-4 E-Mail: prre.inmarianeustift@dioezese-linz.at
      Marias Puchheim Gmundnertown Stra�e 1b 4800 Attnanger-Puchheim Telefonnummer: 007674/62334 FAX-Nummer: 07674/62334-4 E-Mail: prre.inmariapuchheim@dioezese-linz.at
      Marias Scharten Schartenstown 1 4612 Schartensbook Telefonnummer: 007272/5210
      Marias Schmolln Maria Schmollntown 2 5241 Maria Schmolln Telefonnummer: 007743/2209-12 FAX-Nummer: 07743/2209-17 E-Mail: prre.inmariaschmolln@dioezese-linz.at
      Mattighofen R�merstra�e 12 5230 Mattighofentown Telefonnummer: 007742/2273 0676/87765221 FAX-Nummer: 07742/2273-22 E-Mail: peipfarre.inmattighofen@dioezese-linz.at
      Mauerkirchens Pfarrhofstra�e 4 5270 Mauerkirchentown Telefonnummer: 007724/2262




      well i want to delimiter between the different parts of the adress.:

      well you see - we have a encoding iso 8859 issue here.

      waht can we do!? And the major question is - how to delimiter the parts in the bigbig junk of data
Re: enoeding iso 8859 issue within a datadump
by aitap (Deacon) on Oct 06, 2012 at 15:38 UTC
      hello dear buddy


      thx i do so. BTW - i can do the text-to CVS-Job with split too....
Re: enoeding iso 8859 issue within a datadump
by ww (Bishop) on Oct 07, 2012 at 02:50 UTC

    Please, please: read the instructions around each text-entry box for SOPW:

    Use: <p> text here (a paragraph) </p> and: <code> code here </code>
    Simply pasting data makes it difficult; sometimes impossible; to understand just what you're trying to show us.

    And I can't recall a single Perlbeginner1 SOPW that was correctly formatted... which makes me, at least, indisposed to even try to decipher your messes.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://997615]
Approved by Tanktalus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (11)
As of 2014-08-20 09:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (109 votes), past polls