Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^2: Matching alphabetic diacritics with Perl and Postgresql

by anonymized user 468275 (Curate)
on Jun 04, 2017 at 11:29 UTC ( [id://1192119]=note: print w/replies, xml ) Need Help??


in reply to Re: Matching alphabetic diacritics with Perl and Postgresql
in thread Matching alphabetic diacritics with Perl and Postgresql

And you are the winner! open my $fh, "<encode(UTF8)", $csvFile fixed it so that the queries now work. The owners of the original data were using UTF8 to put apostrophes in their database or perhaps to write them in the CSV file. Writing them to my own database as ASCII was OK, but subsequently RSE's would only work if they are also constructed using UTF8. So provided Perl knows it's UTF8 from the outset, DBI constructs the queries correctly.

One world, one people

  • Comment on Re^2: Matching alphabetic diacritics with Perl and Postgresql

Replies are listed 'Best First'.
Re^3: Matching alphabetic diacritics with Perl and Postgresql
by Tux (Canon) on Jun 04, 2017 at 11:37 UTC

    Two points.

    • That suggestion is not what you mean, the correct syntax includes a colon and has different spelling: opne my $fh, "<:encoding(utf-8)"
    • Use a CSV parser that handles UTF-8, like Text::CSV_XS my $aoh = csv (in => "file.csv", encoding => "utf-8");

    Enjoy, Have FUN! H.Merijn
      You are right - I did code it correctly in the .pl, but not in the post here (just (mis-typed) it in from memory). re Text::CSV, that's what I did at first but having switched to an open and read to be able to debug my draft-code issues with clarity, there is no reason to switch back to Text::CSV given that the csv file used is predictable enough to remove first and last chars and then split /\"\,\s*\"/. You could argue that this is a "not invented here" approach, but I am even more loth to use CPAN sledegehammers to crack tiny little nuts where a few characters are all that are needed to avoid loading a module. Think: performance! Some cases are less obvious whether to use the CPAN module, but this one seems clear enough, although I will move it to a utility module where it can be readily replaced with a use of Text::CSV if circumstances change.

      One world, one people

        There is a speed compare page available. Your split will FAIL on one of the easiest pitfalls, used for the timing. This perfectly formatted CSV line will break any split pattern, and it does not even contain embedded newlines:

        hello,","," ",world,"!"

        If you need correct CSV parsing of purely strict CSV, and that excludes space after the separator, use a module like Text::CSV::Easy_XS, which allows no deviation from the standard. If you require speed in addition to robustness, options (like space after the separator) and a usable interface, use Text::CSV_XS. If XS is not an option, use any of the _PP variants.

        The more data you have to parse, the happier a module will make you. It's loading time outweighs the headaches of finding possible breakages.


        Enjoy, Have FUN! H.Merijn

        Think: performance!

        I'm pretty sure Tux has thought quite a bit about performance. Why don't you try benchmarking your routine against Text::CSV_XS? I think you may be surprised just how performant that CPAN sledgehammer is.


        The way forward always starts with a minimal test.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1192119]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2024-04-26 08:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found