atemerev has asked for the wisdom of the Perl Monks concerning the following question:
Dear monks,
I have a quite modest dataset of various financial data (high-frequency currency quotes over a course of years) at the size of around 15 GB. I used to keep it in plaintext files, but now I'd want to run some simple queries with it. Hence, I need a database. I tried MySQL, but 15 gigs seems to be too much for it, with indices in place. Therefore, I put in PostgreSQL.
Now to the question. The usual scenario for this database is going to be 'gimme a whole set of quotes for a certain few days for latter aggregation'. This gives me a resultset which amounts to hundreds of megabytes. It's obvious that I don't want to put it all into memory. At once. For every aggregation request (even with caching, it can be several requests per minute). So, I need a cursor. And guess what? Cursors seem to be not supported in DBD::Pg at all!
Can you enlighten me on the alternatives I have in this situation? Figure out how to use PL/perl? Use another database? Drop the database at all and stick to my beloved plain text data files? Something else? I'd really appreciate a solution.
Re: PostgreSQL cursors with Perl.
by erix (Prior) on Apr 17, 2008 at 14:00 UTC
|
Is there a reason you cannot use the example as documented in the DBD::Pg docs? Maybe you use old versions of server or driver?
The example in the DBD::Pg docs seems to run out of the box. Below I used the example from the documentation and added an actual sql statement, to get the example running.
sub run_cursor {
my ($dbh) = @_;
my $sql = "
select
x as column1
, current_date - x * interval '1 day' as column2
from generate_series(1,25) f(x)
";
$dbh->do("DECLARE csr CURSOR WITH HOLD FOR $sql"); # WITH HOLD is
+not needed if AutoCommit is off
while (1) {
my $sth = $dbh->prepare("fetch 10 from csr");
$sth->execute;
last if 0 == $sth->rows;
while (my $row = $sth->fetchrow_arrayref) {
for(my$i=0;$i<$sth->{NUM_OF_FIELDS};$i++) {
print $sth->{NAME_lc}->[$i], " = " , $row->[$i], "\t";
}
print "\n";
}
}
$dbh->do("CLOSE csr");
}
This gives the following output (postgres 8.3.1, DBD::Pg 2.6.0):
column1 = 1 column2 = 2008-04-16 00:00:00
column1 = 2 column2 = 2008-04-15 00:00:00
column1 = 3 column2 = 2008-04-14 00:00:00
column1 = 4 column2 = 2008-04-13 00:00:00
column1 = 5 column2 = 2008-04-12 00:00:00
column1 = 6 column2 = 2008-04-11 00:00:00
column1 = 7 column2 = 2008-04-10 00:00:00
column1 = 8 column2 = 2008-04-09 00:00:00
column1 = 9 column2 = 2008-04-08 00:00:00
column1 = 10 column2 = 2008-04-07 00:00:00
column1 = 11 column2 = 2008-04-06 00:00:00
column1 = 12 column2 = 2008-04-05 00:00:00
column1 = 13 column2 = 2008-04-04 00:00:00
column1 = 14 column2 = 2008-04-03 00:00:00
column1 = 15 column2 = 2008-04-02 00:00:00
column1 = 16 column2 = 2008-04-01 00:00:00
column1 = 17 column2 = 2008-03-31 00:00:00
column1 = 18 column2 = 2008-03-30 00:00:00
column1 = 19 column2 = 2008-03-29 00:00:00
column1 = 20 column2 = 2008-03-28 00:00:00
column1 = 21 column2 = 2008-03-27 00:00:00
column1 = 22 column2 = 2008-03-26 00:00:00
column1 = 23 column2 = 2008-03-25 00:00:00
column1 = 24 column2 = 2008-03-24 00:00:00
column1 = 25 column2 = 2008-03-23 00:00:00
(edit) Such a generated counter table can be used to join the aggregates of a real table against (below, against a date column; you'll have to adapt 'your_table' with your tablename/columns):
select
to_char(date_table.backcount,'FMDay'),
to_char(date_table.backcount,'FMDD FMMon FMYYYY'),
sum(case when yt is null then 0 else 1 end)
from (
select date_trunc('day', current_timestamp) - x * interval '1 day
+' as backcount
from generate_series(0, (select cast(extract(doy from cast('20081
+231' as date)) as integer))) as f(x) -- count back one year
)
as date_table
left join your_table yt on (date_trunc('day', yt.date::timestamptz) =
+date_table.backcount)
group by date_table.backcount
order by date_table.backcount desc
| [reply] [d/l] [select] |
Re: PostgreSQL cursors with Perl.
by roboticus (Chancellor) on Apr 17, 2008 at 16:31 UTC
|
atemerev:
Usually, large resultsets scream "you're not using the server enough!". You may want to look over the operators that SQL offers you before you just use PostgreSQL as a data server. Since it's iterating over the records anyway, it's usually a good tradeoff to have the server do your aggregation work as well, so you can (hopefully) return smaller datasets.
Remember, with today's CPUs being so blindingly fast, you can do an awful lot of calculation in the same amount of time you can transmit a packet over the network. Obviously, I don't know just what sorts of operations you're performing on your data, so it may be inapplicable to your situation.
For example, if you're interested in the low, high and average quotes for various stocks on a selection of days, you could use:
select TickerName, convert(char(8),quoteDate,112) quoteDate,
min(quote), max(quote), avg(quote)
from quotes
where quoteDate between @startPeriod and @endPeriod
group by TickerName, convert(char(8),when,112)
order by TickerName, convert(char(8),when,112)
So if you have a dozen quotes a day for each of the stocks of interest, this query would return far less data than if you would retrieve all the detail and compute the min/max/average in perl.
...roboticus | [reply] [d/l] |
Re: PostgreSQL cursors with Perl.
by dvryaboy (Sexton) on Apr 17, 2008 at 18:38 UTC
|
First, you can probably offload a lot of the aggregations you are doing onto the database, which will likely do them faster. But that's neither here nor there..
If you are having trouble getting to the right data ranges fast enough, and your queries are almost always for small sub-ranges of 'timestamp' (which, I assume, is indexed) -- look into range partitioning. There is an explanation of how it works here: http://www.postgresql.org/docs/8.1/interactive/ddl-partitioning.html
Also, if you are versed in C, and PostgreSQL doesn't do the kind of aggregations you need out of the box, you should know that it does support user-defined aggregation functions: http://developer.postgresql.org/pgdocs/postgres/xaggr.html
Good luck
| [reply] |
Re: PostgreSQL cursors with Perl.
by BrowserUk (Patriarch) on Apr 17, 2008 at 16:57 UTC
|
| [reply] |
Re: PostgreSQL cursors with Perl.
by ctilmes (Vicar) on Apr 17, 2008 at 13:46 UTC
|
I used to keep it in plaintext files, but now I'd want to run some simple queries with it. Hence, I need a database.
What type of queries? What type of data structures/schema are you using?
FWIW, I've found BerkelyDB very nice for simple data structures, and very fast local queries. (I also find PostgreSQL very nice for more complicated data structures and queries...)
| [reply] |
|
id => int primary key,
timestamp => bigint, # epoch timestamp in milliseconds
bid => decimal,
ask => decimal,
bid_amount => int,
ask_amount => int,
data => varchar # lots of additional data that
# need not to be indexed nor normalized
The queries look like:
SELECT * FROM table WHERE timestamp > ? AND timestamp < ?
Or:
SELECT MAX(bid), MIN(bid), MAX(ask), MIN(ask) FROM table
WHERE timestamp > ? AND timestamp < ?
| [reply] [d/l] [select] |
|
i don't know if sec will work for 15 gig files but sec works very efficiently with monitoring and parsing simultaneously a large amount of log files. hth.
http://www.estpak.ee/~risto/sec/
| [reply] |
|
Thanks for link. Probably I will not use it for this particular task, but this thing definitely is worth looking at.
| [reply] |
Re: PostgreSQL cursors with Perl.
by andreas1234567 (Vicar) on Apr 17, 2008 at 17:41 UTC
|
I tried MySQL, but 15 gigs seems to be too much for it, with indices in place.
I have seen MySQL handle data set ten times that gracefully. What were your findings?
--
When you earnestly believe you can compensate for a lack of skill by doubling your efforts, there's no end to what you can't do. [1]
| [reply] |
|
About 4 seconds for SELECT COUNT(*) for a single day of data (about 100,000 rows). That's the worst case; query caching seems to be working, so repeated (or even slightly changed) queries of the same type take a lot less time.
| [reply] |
|
Although the specifics are probably not worth it now, as you've already moved to a different database, you should _always_ look into tuning your database if you have queries that seem like they're running a bit long.
I don't know enough about your database, so I don't know exactly what might've been going wrong, but I'd probably look at the following:
- Check which indexes are / aren't being used. (in mysql and postgres, 'EXPLAIN', in Oracle, 'EXPLAIN PLAN'). For a limited type of queries it can be more efficient to do a full table scan (but not when you're doing a count(*)).
- Have the database analyze the tables to decide when it should / shouldn't use indexes. (in mysql, 'ANALYZE TABLE', in postgres, 'ANALYZE', in Oracle, 'ANALYZE TABLE' and if your keys aren't evenly distributed, use 'dbms_stats.gather_table_stats')
- Check to make sure that you have enough memory allocated to the program to keep the necessary indexes pinned in memory and reduce disk IO.
- Reduce / eliminate the complexity of table joins if possible. (or force a specific type of table join as appropriate for the situation)
| [reply] |
|
I often hear people say that MySQL is "slow". Which often translates to sub-optimized or under-powered. Consider the following real-world example which shows that MySQL works fine with enormous data sets (Yes that almost 1 billion rows in a single table having 41 columns, 1 primary key, 1 unique key, 9 non-unique keys):
mysql> select count(*) from t_prod;
+-----------+
| count(*) |
+-----------+
| 952174654 |
+-----------+
1 row in set (0.00 sec)
# ls -lh /var/lib/mysql/prod | grep t_prod
-rw-rw---- 1 mysql mysql 11K Jun 4 2007 t_prod.frm
-rw-rw---- 1 mysql mysql 222G Apr 18 07:35 t_prod.MYD
-rw-rw---- 1 mysql mysql 102G Apr 18 07:35 t_prod.MYI
And I can tell up operations on that table are very fast as long as the indices are used properly.
Maintenance, however, is a nightmare. For example:
- When there's a power outage or voltage flux and that 222G table's indices become corrupt and needs to be rebuilt. During which the table is locked and inaccessible.
- When one needs to reduce the data set and the delete operation takes hours. During which the table is locked and inaccessible. (Of course, this can be mitigated by using partition tables).
--
When you earnestly believe you can compensate for a lack of skill by doubling your efforts, there's no end to what you can't do. [1]
| [reply] [d/l] [select] |
Re: PostgreSQL cursors with Perl.
by sundialsvc4 (Abbot) on Apr 18, 2008 at 20:27 UTC
|
At first blush, I think I'd say "go back to your plaintext files." Then, see if you can find a more-efficient way to manage them.
First of all, we know that all of this data is static. It's never going to change, since the quotes in question have already happened. Since we've already committed 15 gigabytes of disk-space to “storing it,” which it is already happily doing, our real objective is ... finding it.
We know that “putting it all in memory” is not an option, simply because we know that all of our “memory” is in-fact a disk file. And it's definitely not going to be a disk-file structure that's in any way conducive to what we want to do here. So... that's out.
Okay, so let's explore various alternatives and see which ones might hold some promise. We're not happy with approach #1, either with MySQL or an alternative, so what else might we do?
One idea that sounds very-appealing to me is to keep the individual files just as they are, and to build some kind of an index-structure alongside of them. For instance, a useful index might be... “which files contain quotes for ‘British Pounds Sterling,’ for ‘August 1997?’”
If the files are of reasonable size, it might be perfectly reasonable to open a bunch of them and scan through them to get our answer... once we've efficiently located the files.
It might also be quite reasonable to take the data files, re-sort them (using a disk-based sort), and exploit this “sorted” property in a good ol' binary search. We could even turn the flat files into Berkeley B-tree files.
A large collection of known-static files calls for different measures ... more like those used by a physical library than those used by an SQL database. A database might prove to be a useful part of the picture, but it might be best-applied in a different way.
| [reply] |
|
|