PostgreSQL cursors with Perl.

atemerev has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: PostgreSQL cursors with Perl. by erix (Prior) on Apr 17, 2008 at 14:00 UTC
Is there a reason you cannot use the example as documented in the DBD::Pg docs? Maybe you use old versions of server or driver? The example in the DBD::Pg docs seems to run out of the box. Below I used the example from the documentation and added an actual sql statement, to get the example running. sub run_cursor { my ($dbh) = @_; my $sql = " select x as column1 , current_date - x * interval '1 day' as column2 from generate_series(1,25) f(x) "; $dbh->do("DECLARE csr CURSOR WITH HOLD FOR $sql"); # WITH HOLD is +not needed if AutoCommit is off while (1) { my $sth = $dbh->prepare("fetch 10 from csr"); $sth->execute; last if 0 == $sth->rows; while (my $row = $sth->fetchrow_arrayref) { for(my$i=0;$i<$sth->{NUM_OF_FIELDS};$i++) { print $sth->{NAME_lc}->[$i], " = " , $row->[$i], "\t"; } print "\n"; } } $dbh->do("CLOSE csr"); } [download] This gives the following output (postgres 8.3.1, DBD::Pg 2.6.0): column1 = 1 column2 = 2008-04-16 00:00:00 column1 = 2 column2 = 2008-04-15 00:00:00 column1 = 3 column2 = 2008-04-14 00:00:00 column1 = 4 column2 = 2008-04-13 00:00:00 column1 = 5 column2 = 2008-04-12 00:00:00 column1 = 6 column2 = 2008-04-11 00:00:00 column1 = 7 column2 = 2008-04-10 00:00:00 column1 = 8 column2 = 2008-04-09 00:00:00 column1 = 9 column2 = 2008-04-08 00:00:00 column1 = 10 column2 = 2008-04-07 00:00:00 column1 = 11 column2 = 2008-04-06 00:00:00 column1 = 12 column2 = 2008-04-05 00:00:00 column1 = 13 column2 = 2008-04-04 00:00:00 column1 = 14 column2 = 2008-04-03 00:00:00 column1 = 15 column2 = 2008-04-02 00:00:00 column1 = 16 column2 = 2008-04-01 00:00:00 column1 = 17 column2 = 2008-03-31 00:00:00 column1 = 18 column2 = 2008-03-30 00:00:00 column1 = 19 column2 = 2008-03-29 00:00:00 column1 = 20 column2 = 2008-03-28 00:00:00 column1 = 21 column2 = 2008-03-27 00:00:00 column1 = 22 column2 = 2008-03-26 00:00:00 column1 = 23 column2 = 2008-03-25 00:00:00 column1 = 24 column2 = 2008-03-24 00:00:00 column1 = 25 column2 = 2008-03-23 00:00:00 [download] (edit) Such a generated counter table can be used to join the aggregates of a real table against (below, against a date column; you'll have to adapt 'your_table' with your tablename/columns): select to_char(date_table.backcount,'FMDay'), to_char(date_table.backcount,'FMDD FMMon FMYYYY'), sum(case when yt is null then 0 else 1 end) from ( select date_trunc('day', current_timestamp) - x * interval '1 day +' as backcount from generate_series(0, (select cast(extract(doy from cast('20081 +231' as date)) as integer))) as f(x) -- count back one year ) as date_table left join your_table yt on (date_trunc('day', yt.date::timestamptz) = +date_table.backcount) group by date_table.backcount order by date_table.backcount desc [download]	[reply] [d/l] [select]
Re: PostgreSQL cursors with Perl. by roboticus (Chancellor) on Apr 17, 2008 at 16:31 UTC
atemerev: Usually, large resultsets scream "you're not using the server enough!". You may want to look over the operators that SQL offers you before you just use PostgreSQL as a data server. Since it's iterating over the records anyway, it's usually a good tradeoff to have the server do your aggregation work as well, so you can (hopefully) return smaller datasets. Remember, with today's CPUs being so blindingly fast, you can do an awful lot of calculation in the same amount of time you can transmit a packet over the network. Obviously, I don't know just what sorts of operations you're performing on your data, so it may be inapplicable to your situation. For example, if you're interested in the low, high and average quotes for various stocks on a selection of days, you could use: `select TickerName, convert(char(8),quoteDate,112) quoteDate, min(quote), max(quote), avg(quote) from quotes where quoteDate between @startPeriod and @endPeriod group by TickerName, convert(char(8),when,112) order by TickerName, convert(char(8),when,112)` [download] So if you have a dozen quotes a day for each of the stocks of interest, this query would return far less data than if you would retrieve all the detail and compute the min/max/average in perl. ...roboticus	[reply] [d/l]
Re: PostgreSQL cursors with Perl. by dvryaboy (Sexton) on Apr 17, 2008 at 18:38 UTC
First, you can probably offload a lot of the aggregations you are doing onto the database, which will likely do them faster. But that's neither here nor there.. If you are having trouble getting to the right data ranges fast enough, and your queries are almost always for small sub-ranges of 'timestamp' (which, I assume, is indexed) -- look into range partitioning. There is an explanation of how it works here: http://www.postgresql.org/docs/8.1/interactive/ddl-partitioning.html Also, if you are versed in C, and PostgreSQL doesn't do the kind of aggregations you need out of the box, you should know that it does support user-defined aggregation functions: http://developer.postgresql.org/pgdocs/postgres/xaggr.html Good luck	[reply]
Re: PostgreSQL cursors with Perl. by BrowserUk (Patriarch) on Apr 17, 2008 at 16:57 UTC
Cursors seem to be not supported in DBD::Pg at all! It does (in 8.x at least). but remember: Pg's cursors are readonly. You cannot do updates through them. PL/Perl isn't hard to use: It is the DB equivalent of mod_perl. You can write user-defined functions in Perl that can do pretty much anything. And doing most of the work on the server before returning just what you need to the client, saves loads of time (Whoops. That apparently doesn't matter:). Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re: PostgreSQL cursors with Perl. by ctilmes (Vicar) on Apr 17, 2008 at 13:46 UTC
I used to keep it in plaintext files, but now I'd want to run some simple queries with it. Hence, I need a database. What type of queries? What type of data structures/schema are you using? FWIW, I've found BerkelyDB very nice for simple data structures, and very fast local queries. (I also find PostgreSQL very nice for more complicated data structures and queries...)	[reply]
Re^2: PostgreSQL cursors with Perl. by atemerev (Beadle) on Apr 17, 2008 at 14:02 UTC
I have quotes data for around 15 currency pairs (like 'EUR/USD', 'USD/JPY' etc). For every currency pair there is a separate table (there are no inter-table joins) of a following structure: `id => int primary key, timestamp => bigint, # epoch timestamp in milliseconds bid => decimal, ask => decimal, bid_amount => int, ask_amount => int, data => varchar # lots of additional data that # need not to be indexed nor normalized` [download] The queries look like: `SELECT * FROM table WHERE timestamp > ? AND timestamp < ?` [download] Or: `SELECT MAX(bid), MIN(bid), MAX(ask), MIN(ask) FROM table WHERE timestamp > ? AND timestamp < ?` [download]	[reply] [d/l] [select]
Re^2: PostgreSQL cursors with Perl. by camlet (Novice) on Apr 17, 2008 at 20:41 UTC
i don't know if sec will work for 15 gig files but sec works very efficiently with monitoring and parsing simultaneously a large amount of log files. hth. http://www.estpak.ee/~risto/sec/	[reply]
Re^3: PostgreSQL cursors with Perl. by atemerev (Beadle) on Apr 18, 2008 at 11:11 UTC
Thanks for link. Probably I will not use it for this particular task, but this thing definitely is worth looking at.	[reply]
Re: PostgreSQL cursors with Perl. by andreas1234567 (Vicar) on Apr 17, 2008 at 17:41 UTC
I tried MySQL, but 15 gigs seems to be too much for it, with indices in place. I have seen MySQL handle data set ten times that gracefully. What were your findings? -- When you earnestly believe you can compensate for a lack of skill by doubling your efforts, there's no end to what you can't do. [1]	[reply]
Re^2: PostgreSQL cursors with Perl. by atemerev (Beadle) on Apr 17, 2008 at 19:13 UTC
About 4 seconds for SELECT COUNT(*) for a single day of data (about 100,000 rows). That's the worst case; query caching seems to be working, so repeated (or even slightly changed) queries of the same type take a lot less time.	[reply]
Re^3: PostgreSQL cursors with Perl. by jhourcle (Prior) on Apr 17, 2008 at 20:36 UTC
Although the specifics are probably not worth it now, as you've already moved to a different database, you should _always_ look into tuning your database if you have queries that seem like they're running a bit long. I don't know enough about your database, so I don't know exactly what might've been going wrong, but I'd probably look at the following: Check which indexes are / aren't being used. (in mysql and postgres, 'EXPLAIN', in Oracle, 'EXPLAIN PLAN'). For a limited type of queries it can be more efficient to do a full table scan (but not when you're doing a count(*)). Have the database analyze the tables to decide when it should / shouldn't use indexes. (in mysql, 'ANALYZE TABLE', in postgres, 'ANALYZE', in Oracle, 'ANALYZE TABLE' and if your keys aren't evenly distributed, use 'dbms_stats.gather_table_stats') Check to make sure that you have enough memory allocated to the program to keep the necessary indexes pinned in memory and reduce disk IO. Reduce / eliminate the complexity of table joins if possible. (or force a specific type of table join as appropriate for the situation)	[reply]
Re^3: PostgreSQL cursors with Perl. by andreas1234567 (Vicar) on Apr 18, 2008 at 06:00 UTC
I often hear people say that MySQL is "slow". Which often translates to sub-optimized or under-powered. Consider the following real-world example which shows that MySQL works fine with enormous data sets (Yes that almost 1 billion rows in a single table having 41 columns, 1 primary key, 1 unique key, 9 non-unique keys): `mysql> select count() from t_prod; +-----------+ \| count() \| +-----------+ \| 952174654 \| +-----------+ 1 row in set (0.00 sec) # ls -lh /var/lib/mysql/prod \| grep t_prod -rw-rw---- 1 mysql mysql 11K Jun 4 2007 t_prod.frm -rw-rw---- 1 mysql mysql 222G Apr 18 07:35 t_prod.MYD -rw-rw---- 1 mysql mysql 102G Apr 18 07:35 t_prod.MYI` [download] And I can tell up operations on that table are very fast as long as the indices are used properly. Maintenance, however, is a nightmare. For example: When there's a power outage or voltage flux and that 222G table's indices become corrupt and needs to be rebuilt. During which the table is locked and inaccessible. When one needs to reduce the data set and the `delete` operation takes hours. During which the table is locked and inaccessible. (Of course, this can be mitigated by using partition tables). -- When you earnestly believe you can compensate for a lack of skill by doubling your efforts, there's no end to what you can't do. [1]	[reply] [d/l] [select]
Re: PostgreSQL cursors with Perl. by sundialsvc4 (Abbot) on Apr 18, 2008 at 20:27 UTC
At first blush, I think I'd say "go back to your plaintext files." Then, see if you can find a more-efficient way to manage them. First of all, we know that all of this data is static. It's never going to change, since the quotes in question have already happened. Since we've already committed 15 gigabytes of disk-space to “storing it,” which it is already happily doing, our real objective is ... finding it. We know that “putting it all in memory” is not an option, simply because we know that all of our “memory” is in-fact a disk file. And it's definitely not going to be a disk-file structure that's in any way conducive to what we want to do here. So... that's out. Okay, so let's explore various alternatives and see which ones might hold some promise. We're not happy with approach #1, either with MySQL or an alternative, so what else might we do? One idea that sounds very-appealing to me is to keep the individual files just as they are, and to build some kind of an index-structure alongside of them. For instance, a useful index might be... “which files contain quotes for ‘British Pounds Sterling,’ for ‘August 1997?’” If the files are of reasonable size, it might be perfectly reasonable to open a bunch of them and scan through them to get our answer... once we've efficiently located the files. It might also be quite reasonable to take the data files, re-sort them (using a disk-based sort), and exploit this “sorted” property in a good ol' binary search. We could even turn the flat files into Berkeley B-tree files. A large collection of known-static files calls for different measures ... more like those used by a physical library than those used by an SQL database. A database might prove to be a useful part of the picture, but it might be best-applied in a different way.	[reply]


more useful options
	PerlMonks