Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Help on selecting and deleting many records in a DB.

by neilwatson (Priest)
on Apr 05, 2014 at 15:52 UTC ( #1081256=perlquestion: print w/replies, xml ) Need Help??
neilwatson has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a database application with Perl and Postgresql, using DBI and Mojolicous. The main DB table is very long and I'm trying to come up with a way to reduce the data. At first I tried a pure SQL approach, but the results were disappointingly slow. Now I'm considering a Perl approach. If you have the patience please read on.

Each line in the table is a log. There will be many rows that are the same except for the timestamp. As the data ages, I want to cull the rows. I want to group records that have the same combination of class, ip_address, and hostname, and keep the highest timestamp for each day from each group.

Table "public.agent_log"

class | text | hostname | text | ip_address | text | promise_handle | text | promiser | text | promisee | text | policy_server | text | rowId | integer | \ not null default nextval('"agent_log_rowId_seq"'::regclass) timestamp | timestamp with time zone | promise_outcome | text | Indexes: "primary_key" PRIMARY KEY, btree ("rowId") "client_by_timestamp" btree ("timestamp", class)

Example data

2014-01-22T13:44:00 any ... 2014-01-22T14:44:00 any ... KEEP 2014-01-22T14:44:00 any ... KEEP 2014-01-22T13:44:00 any ... 2014-01-23T13:44:00 any ... 2014-01-23T14:44:00 any ... KEEP 2014-01-23T14:44:00 any ... KEEP 2014-01-23T13:44:00 any ... 2014-01-22T13:44:00 cpu_1 ... 2014-01-22T14:44:00 cpu_1 ... KEEP 2014-01-22T14:44:00 cpu_1 ... KEEP 2014-01-22T13:44:00 cpu_1 ... 2014-01-23T13:44:00 cpu_1 ... 2014-01-23T14:44:00 cpu_1 ... KEEP 2014-01-23T14:44:00 cpu_1 ... KEEP 2014-01-23T13:44:00 cpu_1 ...

I tried aggregating the data with

SELECT class, max(timestamp) as timestamp, hostname, ip_address, promise_handle, promiser, promisee, policy_server, promise_outcome FROM agent_log WHERE timestamp < now() - interval '7 days' GROUP BY class, DATE_TRUNC( 'day', timestamp), hostname, ip_address, promise_handle, promiser, promisee, policy_server, promise_outcome

I wrapped that in a query that didn't quite work, but was essentially DELETE * FROM agent_log WHERE NOT IN... It was painfully slow.Now I'm wondering if I can do this with Perl instead. Using cursor I could fetch rows, group them, and delete the lower date ones. Is it likely that using Perl in anyway could be faster than raw SQL? What would you do?

Neil Watson

Replies are listed 'Best First'.
Re: Help on selecting and deleting many records in a DB.
by erix (Vicar) on Apr 05, 2014 at 16:23 UTC

    Perl will simply run SQL, via DBI; it will not be faster than running the SQL directly (it will even be (a very little) slower). And you can run cursors in straight SQL too (well, in plpgsql); cursors have nothing especially to do with perl.

    Perhaps perl gives you some more flexibility.

    Which version of postgresql?

    How large is the table (in GB and in rowcount)? And how large the part to be deleted?

    One approach to data removal is to setup the data as a partitioned table (partitioned by date/timestamptz) which then lets you regularly drop the oldest table. See the fine manual (on that page, read also the partition caveats; mainly: don't use more than a few hundred partitions).

      PG 8.4. Millions of rows. Several GB. I wish to delete a significant fraction of the table. How much depends on the size of application deployment.

      Neil Watson

        PG 8.4.

        I hope you realise that 8.4 goes out of support in 3 months. It's decidedly long in the tooth (to use another equine saying).

        What I would do is use a recent postgres version (9.3.4 and even 9.4devel (which is almost in feature-freeze anyway)), fix any database problems on those, and then (if your app really needs it) work back to fix it on legacy versions (i.e.: 9.2.8, 9.1.13, 9.0.17). It's time to drop support for 8.4 (I see you even still mention 8.3 on your pages...).

        And RichardK is right: get yourself acqainted with EXPLAIN ANALYZE output (at least to the point where you can tell whether an index is used, and useful).

        if you can't get it sorted out with an index, I still think partitioning is the best way to solve your problem. I think all the talk of stored proceduring is a red herring.

        IRC freenode #postgresql is often full of people who relish this kind of problem, even if it is somewhat FAQ-ish (sometimes you have to be patient).

        Do let us know how you solve this. I'm interested. (If it wasn't 8.4 I'd probably have setup a little trial example...)

        update: changed order of paragraphs: first try appropriate indexing, if that fails, partition).

Re: Help on selecting and deleting many records in a DB.
by RichardK (Parson) on Apr 05, 2014 at 16:31 UTC

    Doing the work in the database is almost always faster. Did you try running postgresql's EXPLAIN on your query?

    Do you have enough indexes on your table?

      Since a unique record is based on a combination of most fields the current indexes do not work well for this operation. Would adding another field, that is a concatenation of the other fields, and indexing that be a good approach?

      Neil Watson

        No, you need to create an index that matches your query terms. See:

        You can try this :

        delete from agent_log t1 where WHERE timestamp < now() - interval '7 days') and timespan < (select max(timestamp) from agent_log t2 where t2.class=t1.class and t2.hostname=t1.hostname and t2.id_address=t1.ip_address)

        I suppose you have an index on timestamp and the other fields? A surrogate key might help, or even using timespan as primary key, if possible.

        If you can create a history table, then you could:

        SELECT class, max(timestamp) as timestamp, hostname, ip_address, promise_handle, promiser, promisee, policy_server, promise_outcome FROM agent_log WHERE timestamp < now() - interval '7 days' into history_table ; delete from agent_log WHERE timestamp < now() - interval '7 days'

        That would be a good start towards partitioning your data, which you might need to consider.

 : legal case, contract and insurance claim management software

      I did some testing with a small sampling of data. Explain and analyze below.

      Seq Scan on agent_log (cost=48184.35..52265496.54 rows=101705 width=122) (actual time=5288.903..175884.135 rows=203428 loops=1)
      Filter: (("timestamp" < (now() - '7 days'::interval)) AND (NOT (SubPlan 1)))
      SubPlan 1
      -> Materialize (cost=48184.35..48646.77 rows=20342 width=77) (actual time=0.022..0.317 rows=2604 loops=203428)
      -> GroupAggregate (cost=44040.21..47905.01 rows=20342 width=77) (actual time=4417.518..5286.975 rows=2604 loops=1)
      -> Sort (cost=44040.21..44548.73 rows=203410 width=77) (actual time=4417.496..5223.285 rows=203428 loops=1)
      Sort Key: (date_trunc('day'::text, public.agent_log."timestamp")), public.agent_log.class, public.agent_log.promiser, public.agent_log.hostname, public.agent_log.ip_address
      Sort Method: external merge Disk: 19648kB
      -> Seq Scan on agent_log (cost=0.00..8024.51 rows=203410 width=77) (actual time=0.011..175.827 rows=203428 loops=1)
      Filter: ("timestamp" < (now() - '7 days'::interval))
      Total runtime: 175924.695 ms

      Neil Watson

Re: Help on selecting and deleting many records in a DB.
by sundialsvc4 (Abbot) on Apr 06, 2014 at 11:47 UTC

    If you have “millions of rows,” the last thing you want to do is to send those millions of rows down a pipe to another computer, or even to another program in the same one.   You will want to do this in a stored procedure that is executed by the server, on the server.   Furthermore, you will probably want this stored procedure to execute a loop which runs through each record in a query such as SELECT class, ip_address, hostname, MAX(timestamp) .. GROUP BY .., and which then, for each record found (each one representing a single group and providing the timestamp that should be kept):   starts a transaction, deletes the qualifying records (using a single DELETE), and then commits.

    However, to do that efficiently, you will need to create an index on (class ip_address, hostname), and I would do that first.   You may well find that, when you have done this, a pure-SQL solution using a NOT IN query will now yield acceptable performance.   (Although it may be too phat to run as a single transaction.)   The absence of such an index is probably why your existing approach is running too slow ... and, why every other one would run too slow, as well.

Re: Help on selecting and deleting many records in a DB.
by Anonymous Monk on Apr 05, 2014 at 17:44 UTC
    "IS (NOT) IN" is probably the slowest operation in SQL.
Re: Help on selecting and deleting many records in a DB.
by Anonymous Monk on Apr 05, 2014 at 17:30 UTC
    The faster answer for complex queries is always to define a SQL function (this can be done in PL/PERL).

      Can you explain? I don't understand why "a SQL function" would do anything to make this faster, or even easier in any way.

      (Btw, in postgres, "SQL function" normally means a function written in SQL, i.e. a PL/SQL function [1] (as opposed to a PL/pgSQL function [2] (or PL/Perl [3], PL/Tcl, PL/Python, etc)).

      [0] Server Programming

      [1] PL/SQL

      [2] PL/pgSQL

      [3] PL/Perl


        Creating a function for your needs can be as simple as:

        1. Write this code in a file called myfun.test:

        CREATE OR REPLACE PROCEDURAL LANGUAGE plperlu; ALTER PROCEDURAL LANGUAGE plperlu OWNER TO postgres; CREATE OR REPLACE FUNCTION mytest(text) RETURNS text LANGUAGE plperlu AS $_X$ my $data = $_[0]; my $string = $data." -->from mytest\n"; return $string; $_X$;

        (Of course you can have as many arguments and return types you want in your function's definition.)

        2. Insert/replace this function in your db by issuing the command line:

        # \i /path/to/myfun.test

        3. Run the test (given a table called data and a text field called f1):

        # select mytest(f1) from data limit 10

        And that's it... When your function will be ready you can dump/restore it permanently into your db if you want.

        If you kown exactly what you do and don't have to experiment a little, and love to type long command lines in SQL, sure everything can be done there. It's clear that with your level of awareness, such things don't bother you. There are other reasons - like subsequent functions calls can dispense from nested requests, but I have no time to go in detail. Cheers!
Re: Help on selecting and deleting many records in a DB.
by nikosv (Chaplain) on Apr 06, 2014 at 15:17 UTC
    Some thoughts :

    1. Is it possible to build an index on the columns that comprise the group by ? make it a Hash one since you work on exact equality matches, thus the delete will operate a lot faster.

    2. Try using NOT EXISTS (although they are not the same semantically, but in most cases they do match)

    3. Having a function, DATE_TRUNC, inside the group by slows the query a lot. Turn on the query optimizer to check if that it is true.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1081256]
Front-paged by toolic
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2017-01-17 11:26 GMT
Find Nodes?
    Voting Booth?
    Do you watch meteor showers?

    Results (155 votes). Check out past polls.