http://www.perlmonks.org?node_id=153714


in reply to Up for Critique

This has mainly been a Perl discussion, as it should be on the monks, but there have been a few suggestions here to turning off indexing, and what not. So I thought I would throw my two bits in regarding that subject.

I'm going to use a simple data structure here.
CREATE TABLE whois_e ( id_num int(11) DEFAULT '0' NOT NULL, email int, url int, query char(128), sent int(11) DEFAULT '0' NOT NULL, );
Now if you create indexes on id_num, email, url and sent like this
create index whosis_1 on whois_e(id_num); create index whosis_1 on whois_e(email); create index whosis_1 on whois_e(url); create index whosis_1 on whois_e(sent);
and do a query something like this
select * from whois_e where email = '$email' and url = '$url' and sent = '1'
then you aren't going to see a significant amount of speed, in fact things will probably be a bit slower.

Indexes should be designed sparingly and to the applications need. If this is a query I'm going to use often, the index should be set up more like this.
create index whosis_2 on whois_e(email, url, sent);
This will improve this query by a noticeable amount. Also any data file should have a primary key, I do this whether it needs it or not.
CREATE TABLE whois_e ( id_num int(11) unsigned DEFAULT '0' NOT NULL auto_increment primary + key, email int unsigned not null, url int unsigned not null, query char(128), sent int(11) DEFAULT '0' NOT NULL, );
Indexing will speed up a data base incredibly, but it needs to be done right and sparingly. I've seen some programs with indexes on every field, and that just makes two data tables, what's the sense in that? :-)

A table without an index of some sort is just a collection of unordered rows. To find a row or set of rows from such a table, the engine has to examine every single row in the table. Even if it finds it at the top, unless you've put a limit on your query, its going to search through the rest of them. This is wasted time and CPU power. Using an index however the values are sorted and keyed to the table. The engine knows if it finds the answer to stop looking, because it knows anything that follows will not equal the query set. That's a huge advantage.

Looking at your queries and finding some common keys in the "where" area, will help you build suitable indexes for those tables.

If you are going to use a join, any type of join, those fields should always be indexed, otherwise even on small datasets your programs going to crawl.

Now, the disadvantages are in the writing to the files with indexes. Here is where indexes are working against you, because they have to be inserted and updated as well. Some hints on indexing and loading tables follow, in no particular order, just off the top of my head.

Bulk loading is faster and better on indexed tables. Use LOAD DATA instead of INSERT when ever you can. Index flushing happens less often and the server needs to pares and interpret one statement, not several

If you must use INSERT then use the form that allows multiple rows to be specified in a single insert statement.
INSERT into whois_e VALUES(...), (...),(...), (...).....;
The more rows you can specify in the statement the better.

One trick is to create the table, load all your data and then add the indexes, this comes in handy. If you can't do that, drop the indexes, load the data and then recreated them. Do this only if you can separate your program into a single load area, and then the query area. Since your program is structured the way it is, then this could be an option, on other programs it may not be, and probably won't be, because other users are on the database at the same time, and you'll get a lot of phone calls from angry accountants and beavers. The main question to ask yourself is "can I do this only once" if the answer is no, then don't do it, it will slow your program down.

Hope this is some use to you.. there's a lot of good stuff in this node and some of it I'm writing down and playing with, so thanks to all the rest of you..

That's my two bits... what?... I'm over extended?.. damn.

---UPDATE --

Something I just read seemed applicatble thought I would post it in here. According to the O'reily Mastering Algorithms with Perl book -> " page: 26 Efficincy Tip: Hashes Versus Arrays: It's about 30% faster to store data in an array than in a hash. It's about 20% faster to retrieve data from an array than from a hash". -- end update --

Glenn H.