Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re^9: Text::CSV encoding parse()

by haukex (Chancellor)
on Aug 17, 2019 at 06:47 UTC ( #11104604=note: print w/replies, xml ) Need Help??


in reply to Re^8: Text::CSV encoding parse()
in thread Text::CSV encoding parse()

I've substituted my own string for yours and here's what I see now in the browser:
$VAR1 = "/search/\x{bf}Cuales son las partes de una cadena de conexi\x{f3}n??scope=SSGU8G_12.1.0|/com.ibm.jdbc_pg.doc/ids_jdbc_011.htm|0|1|1|0\n";
UTF-8 flag is on

/search/¿Cuales son las partes de una cadena de conexión??scope=SSGU8G_12.1.0|/com.ibm.jdbc_pg.doc/ids_jdbc_011.htm|0|1|1|0

After Text::CSV: /search/¿Cuales son las partes de una cadena de conexión??scope=SSGU8G_12.1.0 | /com.ibm.jdbc_pg.doc/ids_jdbc_011.htm

\x{bf} or U+00BF is the inverted question mark and U+00F3 is “ó”, so AFAICT that looks correct, doesn't it?

BTW I'm assigning $data by reading a stdout output file that is returned from the db via the tool I'm using for the SQL query.

Ok, so do I understand correctly that you've tested the code you showed within the script I posted?

(BTW, why not use DBI to connect directly to the database instead of going through a file?)

I've also tried opening the file without the :encoding and get the same result.

Well, if you've got the use open qw/:std :utf8/; that I suggested at the top of the script, that also sets utf8 as the default encoding layer. As long as you're not getting warnings such as "Malformed UTF-8 character: \xbf" or "utf8 "\xBF" does not map to Unicode" (check the web server's error logs and/or run the script from the command line), then it would seem your file is probably encoded as UTF-8, and it would seem that everything is correct...

So are you still seeing the same problem as before, or is it working now?

Replies are listed 'Best First'.
Re^10: Text::CSV encoding parse()
by slugger415 (Monk) on Aug 20, 2019 at 17:25 UTC

    \x{bf} or U+00BF is the inverted question mark and U+00F3 is “ó”, so AFAICT that looks correct, doesn't it?

    Yes it looks correct in my browser.

    Ok, so do I understand correctly that you've tested the code you showed within the script I posted?

    Yes.

    (BTW, why not use DBI to connect directly to the database instead of going through a file?)

    Working on that! The tool I'm currently using doesn't allow it, so I'm running a system command to execute the query. Researching direct DBI access but there are access issues within my org...

    So are you still seeing the same problem as before, or is it working now?

    Sadly problem still exists, yes. To summarize:

    foreach my $row (@sorted_urls){ $csv->parse($row); my @els = $csv->fields; #using Text::CSV my(@splits) = split('\|',$row); #using split on same $row print $q->p($els[0]); #prints wrong print %q->p(splits[0]); #prints right }

    Honestly this is not worth wasting a lot of time on, though I really appreciate your help... thank you.

      So here it gets interesting. Is it possible to get us that list of url's online somewhere so I/we could test on them?

      If not, would it be possible to install Data::Peek and show me/us the output of

      foreach my $row (@sorted_urls) { DPeek ($row); $csv->parse ($row); my @csv = $csv->fields; #using Text::CSV my @row = split m/\|/ => $row; #using split on same $row DPeek "CSV: $csv[0]"; DPeek "SPLIT: $row[0]"; }

      And I also have no idea what $q->p () has as influence on the output and I also guess that %q->p is a typo.


      Enjoy, Have FUN! H.Merijn

        I can't really give you the whole shebang but here are a couple of URLs, including the first one which has the spanish characters.

        https://www.ibm.com/support/knowledgecenter/es/search/¿Cuales son las +partes de una cadena de conexión??scope=SSGU8G_12.1.0|https://www.ibm +.com/support/knowledgecenter/es/SSGU8G_12.1.0/com.ibm.jdbc_pg.doc/ids +_jdbc_011.htm|0|1|1|0 https://www.ibm.com/support/knowledgecenter/search/onsmsync?scope=SSGU +8G_12.1.0|https://www.ibm.com/support/knowledgecenter/SSGU8G_12.1.0/c +om.ibm.sec.doc/ids_lb_002.htm|1|1|1|1

        Thanks!

      Sadly problem still exists, yes. To summarize:

      That's strange. Could you please show a complete example of the code that doesn't work, i.e. the full script along with its output? Also, to play it safe, try upgrading your installations of Text::CSV and Text::CSV_XS.

        Hi @haukex sorry I can't give you the whole script due to security and privacy concerns, but I can give you the salient parts of it.

        #execute the query using Aginity Workbench; output saved to flat filfe my($res) = system($cmd); ### read the output my(@urls); my($header); open my $fh, "<:encoding(utf8)", "$resultsFile" || die("cannot open re +sults file $resultsFile for reading.\n cmd: $cmd"); my($c)=0; # just here for counting my($d)=0; # just here for counting while(<$fh>){ $c++; if($c == 1) { # get header row $header = $_; } if ($_ =~ /\/search\//){ push(@urls, $_); } else{ $d++; } } close($fh); # sort @urls based on the search string # e.g. https://www.ibm.com/support/knowledgecenter/es/search/¿Cuales s +on las partes de una cadena de conexión??scope=SSGU8G_12.1.0|https:// +www.ibm.com/support/knowledgecenter/es/SSGU8G_12.1.0/com.ibm.jdbc_pg. +doc/ids_jdbc_011.htm|0|1|1|0 my @sorted_urls = map { $_->[0] } sort { $a->[1] cmp $b->[1] } map { m|/search/\s*([^\?]+)\?|; [$_, $1] } @urls; # parse and print print $q->header(-charset => 'utf-8'); print $q->start_html( -title => 'SearchME', -style=>{'src'=>$stylesheet}); print $q->start_table(); foreach my $row (@sorted_urls){ # print TEMP $row; $csv->parse($row); print "<tr>"; $count++; my @els = $csv->fields; my(@splits) = split('\|',$row); $els[0] =~ /\/search\/(.+)\?scope=/i; my($term) = $1; my($link) = $els[0]; print "<td>"; # print $link; print $q->a({-href=>$link,-target=>'_blank'},$term); print "</td>"; # print other @fields here inside <td></td> } print $q->end_table, $q->end_html;

        oh and I reinstalled both modules.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11104604]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (7)
As of 2019-09-17 19:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The room is dark, and your next move is ...












    Results (215 votes). Check out past polls.

    Notices?