Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: Matching alphabetic diacritics with Perl and Postgresql

by chacham (Prior)
on Jun 03, 2017 at 19:15 UTC ( #1192075=note: print w/replies, xml ) Need Help??


in reply to Matching alphabetic diacritics with Perl and Postgresql

Aside from using dynamic SQL and incurring its ills, running a query to check for duplicates when the actual insertion will be another query, allows a duplicate to be inserted between the two statements, unless the table is locked. Hence, it is generally recommended to use an atomic statement, that is, insert into...where not exists(); This guarantees the record is not there at the time of insertion itself.

Also, if you have a CSV file with millions of records, it would likely be best to import the data using something like COPY, and either put it directly into the table you want, or use a staging table with no constraints, and then a simple insert into target(cola, colb, ...) select cola, colb, ... from staging where not exists(select * from target...).

  • Comment on Re: Matching alphabetic diacritics with Perl and Postgresql

Replies are listed 'Best First'.
Re^2: Matching alphabetic diacritics with Perl and Postgresql
by (anonymized user) (Curate) on Jun 03, 2017 at 19:36 UTC
    Chicken and egg: I first have to ensure what I insert can be matched before I can allow other users in, at which point yes, I am then in a position to use a where not exists clause. Until then, the where not exists will not help me.

    Regarding COPY - I have constraints from other relations than the one with the issue:

    Location -> lelo <- legal_entity

    lelo_type -> lelo

    and some others that have to be dealt with after. So I have to generate unique location ids, unique le_ids and join them in lelo which has unique ids per type, location and entity. (at this point, type of lelo is registered office, although other types will come as the database develops). COPY cannot be used when building related data simultaneously and needing a single transaction for all related data in different tables especially where fk constraints are needed.

    One world, one people

      I see you have thought this one out. I would still like to suggest that you add a staging table and copy the data into it before processing, and then do all you processing in sql. This should avoid the issues as the database itself will be doing the comparisons, and with where not exists, at the same time as the insertion.

      If there are multiple steps, a stored procedure can be used, or a series stored procedures called from another one, should you want separate entry points.

        Staging table means all my efforts to clean the data have to be migrated from Perl to SQL. This is particularly awkward for location. To take an example, the registered office at Companies House might be as daft as:

        Country of Origin: Channel Islands

        RegAddressCountry: GY1 4PW

        RegAddressCounty: St. Peterport

        PostTown:

        Address Line1:MR FRED BLOGGS

        Address Line 2: The Company Name

        CompanyName: A variation of the company name

        Care of:

        Postcode:

        So the algorithm to clean the data is broadly: delete or translate some countries (Channel Islands is deleted because it isn't a country. But postcode GY1 implies the country is Guernsey. For my model, the county is also Guernsey as opposed to Alderney. If more than one level is on the same line they have to be split. If data is at too high a level, e.g. the registered country was mistakenly put in Country of Origin and then everything moved up one from where it should be, defaults have to be pushed on the stack moving the location lines down (largest location items being higher in the list, the way I am doing it). So I have hashes of common translations e.g. if Curacao has a sedilla whereas my country table does not have sedillas and BVI gets unabbreviated to British Virgin Islands. The number of ways Ireland gets spelled is particularly astounding ROI, Rep of. etc.. Just about every Republic of something needs different translations there are even several ways for China. Mis-spellings like United Kinmod etc. Scotnadl, Isalnds, Isles-> Islands all have to be corrected, there's a method that splits location lines where pattern matching is needed, a list of deletable countries including Channel Islands and British West Indies to force the next level to be promoted to a country, a list of pushable items where the country is missing already, like Leicester being entered as a country in the register and the facility to assume registered country is the same as the country of origin for some cases. I could go on forwever, but it's about 200 lines of OO Perl and would be about 100 pages in SQL if I load it in before processing.

        Update: Yes, I could call the Perl from Postgres after staging, at least I can do that in my dev. env., but I can't expect that to hold true when it gets ported to the hosting env.

        One world, one people

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1192075]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2021-01-25 01:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?