Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Re^5: Matching alphabetic diacritics with Perl and Postgresql

by chacham (Prior)
on Jun 04, 2017 at 03:33 UTC ( #1192101=note: print w/replies, xml ) Need Help??

in reply to Re^4: Matching alphabetic diacritics with Perl and Postgresql
in thread Matching alphabetic diacritics with Perl and Postgresql

I see. I'll just add my 2 cents for your perusal.

Batch processing with loaded data is almost always better in the database simply because it is made for it. It ends up being faster because systems are not switched, the database can do things in batches (as opposed to line by line) and character issues are unlikely to be coming up. So, before you even start, you're already ahead.

I'm relatively sure you can do it a few statements too, if done correctly. I have done things like this before, and on one job, i created a batch table to make record (and time) the different steps, allowing multiple batches to be done at the same time (because the steps do not have anything to do with the batches, per se).

If there are common misspellings, i would likely use a translation table to change misspellings to the used spelling, or move a country to the next step. A small query would do that (perhaps as part of a larger query). Anything not found would be flagged for human perusal, which would happen less and less with each run. Having the translation table in the database is likely better for managing it anyway, and you can even keep fun numbers like how often each shows up.

To remove spaces and new lines, postgres has replace functions, including regex. That's one very simple query.

In my experience, the sql code is smaller, not larger, if written correctly. However, if you already have it written in perl, there's no reason to reinvent the wheel, unless a large problem comes up, or you want/need processing to take (what might be) a fraction of the time.

  • Comment on Re^5: Matching alphabetic diacritics with Perl and Postgresql

Replies are listed 'Best First'.
Re^6: Matching alphabetic diacritics with Perl and Postgresql
by marinersk (Priest) on Jun 04, 2017 at 10:19 UTC

    Also, a thought:

    If throughput efficiency is a trade-off you can afford, using an INSERTinto a staging table and then pulling it back out again might allow you to use Postgres as a 100% Postgres-compatible data filter, if it changes the data upon storage.

    Everything in engineering is a trade-off; the idea of introducing an execution inefficiency is naturally anathama to an engineer's way of looking at a problem, but if this approach causes a data translation to occur, putting the onus on Postgres to perform the translation for you does have the merit of adjusting perfectly each and every time Postgres changes the way it handles the data.

    Effectiveness is often more important than efficiency; if no more efficient solution can be found and if this does the job, it could also be used as a work-around until a more graceful solution can be found.

    I'd document the heck out of it, though. Have mercy on the poor next soul who has to touch the code.

Re^6: Matching alphabetic diacritics with Perl and Postgresql
by (anonymized user) (Curate) on Jun 04, 2017 at 11:18 UTC
    Although this is an initial load scenario, data validation will eventually need to be performed from the browser. It might be advisable to move such code upward from Perl in the direction of Javascript at some point, but moving it downward into the database is going in the wrong direction, creating not only unnecessary internet communication but unnecessary communication between application and DBMS. Ideally, data should be cleaned before it gets to the database. One popular late 90s solution in market data warehousing was to use a special language called Formula Engine to clean data based on a C++ library before it is inserted in a hybrid database (SQL/TSDBMS). But in today's paradigm, the functionality is destined eventually to move to the GUI so that the data is clean before it even hits the server.

    One world, one people

      moving it downward into the database is going in the wrong direction

      I disagree. If anything, as the data can be manipulated from more than one location, the best place is "downward" in the rdbms. That is the one centralized location for your data. Anything else adds more code that needs to be kept in sync, reinvented, and allows for inconsistencies.

      creating not only unnecessary internet communication

      No matter what, you have to communicate with the database, and the communication at some point goes over the Internet anyway, so you have not saved anything. Even if you mean that the UI can handle some of the manipulation without communication, that would include none of the validations that require the database. I do not see how you are saving anything significant communication-wise.

      but unnecessary communication between application and DBMS.

      I fail to understand what you even mean here, sorry. Communication between the application and the database in the single most important factor in a data driven application. In many cases, the drivers automatically compress the data for you as well.

      Ideally, data should be cleaned before it gets to the database.

      You seem to be viewing the database as a dumb data container instead of the complex data master that it is. The database is made for this. There are keywords and functions specifically designed to help you clean and manipulate your data. To do it elsewhere first and then put it in the database, is reinventing the database outside of it.

        It is a well known performance issue to minimise the traffic between browser <-> internet <-> application <-> DBMS. I am not sure what else I can say to explain it. The need to clean the data in SQL was because the original user interface did not do it. If I take your advice then either I repeat that mistake, making the project pointless, or I complete transactions to load invalid data to the database before returning an error to the browser to get the user to fix it. Your belief that anything can be done in SQL, surely cannot reasonably extend to pre-empting data entry mistakes? It certainly appears that way.

        One world, one people

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1192101]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (3)
As of 2021-01-25 01:51 GMT
Find Nodes?
    Voting Booth?