Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

(OT) Improving a SQL JOIN statement

by Masem (Monsignor)
on Apr 10, 2001 at 17:38 UTC ( [id://71323]=perlquestion: print w/replies, xml ) Need Help??

Masem has asked for the wisdom of the Perl Monks concerning the following question:

(Yes, I know this is OT for PM, but as was pointed out in this thread, as long as I indicated as such and it's sorta-related to perl, it's not a big problem...)

As part of an update to a website that I run, I'm trying to improve the efficiency of the database SQL statements that I'm using. The db is composed of several tables; the main one that will be searched against most often contains about 1500 items with a primary key. (Data will be added or removed from this table but only when I do said updates). Users of the site can rate such items in a number of catagories, but as to keep track of what users have rated already and any short comments they had made, I've stuck the ratings into a second table, with the item's id, the user's id, a unique id for the rating which is used to pull comments from a file, and other details. As of this time, each item has about 10 or so ratings, making around 10,000 items in the second table; this can increase dramatically with additional users rating the items on the site.

What I would like to do is to be able to search or sort on the averages or counts of these ratings when I join that table with the primary table, using the item id as the joining element. As I noted in my recent post DBI and JOINs, I found initially that JOIN operations were way too slow on my system, but others suggested indexing the tables may help. With other aspects of this same problem, I did do that and found that complex joins (ones that includes two joins) were occuring nice and fast after indexing. However, I cannot seem to improve the speed on joining the main table with the ratings table. Processing only 50 items on the main table took about a minute of CPU time, compared to less than a second for the other join I mentioned above. I can't seem to specify an index for this ratings table that will improve the method of joining.

Presently, I work around this by having fields in the primary table that stores the average and counts for the ratings, which is updated any time a new rating is entered. It works, but it seems mighty inefficient. And to relate this back to perl, the main table updating could be one of those things that I easy forget to do in one of the CGI scripts and therefore I'd like to have a more automatic method.

Can anyone suggest a possible way to improve the database join methods or indexing to get around this, or another way to store such data? Do note that I need each rating as a unique item as to keep track of whom has already rated as well as for commenting, so aggregating the results from the start won't work. I'm using Mysql as the database server in this case.


Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain

Replies are listed 'Best First'.
Re: (OT) Improving a SQL JOIN statement
by jorg (Friar) on Apr 10, 2001 at 18:02 UTC
    For future diagnostics : it would help a lot if people would includ an 'Explain Plan' output whenever there are database performance issues on certain queries. For Mysql just do a 'EXPLAIN select foobar from tableBar a, tableFoo b where a.foo=b.bar' and it will tell you how the mysql optimizer handles this query.

    From what you've told us it looks like mysql is ignoring the indexes :
    - make sure you choose your 'driving' table correctly, if you want to group all the ratings per item then your itemtable will be driving the query. (and thus put 'select blah from itemtable, ratingtable where...' instead of 'select blah from ratingtable, itemtable where...'
    - looking at the mysql docs: 'MySQL can't (yet) use indexes on columns efficiently if they are declared differently' ie make sure you've declared your join columns in the same way, don't declare one as a char(10) and the other as a char(15)
    - you can tell the optimizer to use an index by using hints on which index to use. Mysql docs have all the info on that as well.

    Jorg

    "Do or do not, there is no try" -- Yoda
Re:(Zigster) (OT) Improving a SQL JOIN statement
by zigster (Hermit) on Apr 10, 2001 at 17:48 UTC
    If you have a primary key then the join should be fully qualified. The main reason for slow joins is poor qualification. Do you have to have a DISTINCT in your sql? If so then your join qualification is not complete. You should look to be able to linke each row with exactly one other on the other table. If you have a primary key (and use this as a foriegn key all will be well. Unfortunatly without a lot more information it is difficult to know what to suggest.

    Suffice to say that with the data volumes you are suggesting a two table join should return witnin seconds not minutes so you are right to optimise. Try getting some SQL analysis software and run your sql throught it. Look for full table scans.

    Look in your sql for clauses that will defeat the indexes. Do you use TOUPPER LIKE or any similar sql. Rememeber it is all well having indexes only if the database can use them.

    I HTH
    --

    Zigster

      The this is, it's not a 1 to 1, it's a 1 ( from primary ) to many ( from secondary ), on the order of between 10 and 50 currently, but easily can go upwards.
      Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
Re: (OT) Improving a SQL JOIN statement
by jeroenes (Priest) on Apr 10, 2001 at 18:08 UTC
    Hmmm... like perl, one can write SQL in dozens of ways. I would go like:
    CREATE VIEW avgrating SELECT itemid, AVG( rating ) AS the_avg FROM ratinglist WHERE catagory = ? GROUP BY itemid ORDER BY AVG( rating ); SELECT itemlist.itemid, the_avg, foo FROM itemlist, avgrating WHERE itemlist.itemid = avgrating.itemid;
    I don't know what that does to performance, neither if mysql supports views. But I guess that the view creation allows the RDBM to improve efficiency of the second SELECT. Moreover, I have understood that the implicit JOINing allows for speed optimization in most RDBMs.

    Hope this helps,

    Jeroen
    "We are not alone"(FZ)

Re: (OT) Improving a SQL JOIN statement
by suaveant (Parson) on Apr 10, 2001 at 18:22 UTC
    if mysql, if you have a table with a primary id and another table with ratings, and I wanted to get all the items that have a usability rating of 8 or more, I would do the following...
    Tables like... create table items ( itemid int(11) default 0 not null, other_data text, primary key (itemid); ); create table ratings ( itemid int(11) default 0 not null, ratingtype varchar(32) default '' not null, rating int(11) default 0 not null, primary key (itemid,ratingtype), key (itemid,ratingtype,rating) }; SELECT * FROM items AS I, ratings AS R WHERE R.itemid = I.itemid AND R.ratingtype = 'usability' AND R.rating >= 8;
    and that should work...

    I just used int for rating for simplicity, you'd probably use a decimal...
    Or are there multiple ratings for on item in the same ratingtype? like an entry for each person who rarted it? If that is the case I would suggest just storing the average if you can, or having a table of averages and a table of all entries. but maybe I misunderstand.
                    - Ant

Re: (OT) Improving a SQL JOIN statement
by Masem (Monsignor) on Apr 10, 2001 at 17:54 UTC
    As suggested by others, here's the typical SQL statement that I'm using for the joins, in case this itself may be inefficient:
    SELECT itemid, AVG( ratinglist.rating ) FROM itemlist LEFT JOIN rating +list USING (itemid) WHERE ratinglist.catagory = ? ORDER BY AVG( ratinglist.rating )

    Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
      It looks like you are missing a 'GROUP BY' clause. But my question is why are you even selecting from the itemlist table? If every itemid in ratinglist is in itemlist (Does your database do foreign key constraints?), then there's no reason to join to item list. And that alone could be killing your query because there is no really good way for a SQL optimizer to handle that query.

      Just select from ratinglist without the join, try the select once with just an index on ratinglist.itemid, once with an index on category, and once with just a composite index on categoryid, itemid (this seems to be the ideal index for this select, but might be unneccessary). Or create all the indexes, do an explain plan, and see which index is selected.

      For the composite index, you might have to include category in both the select and the group by clause. Then again, the order by may be killing your query in which case there's not a whole lot SQL wise you can do about it, though maybe from a database tuning perspective you can.

      Update: In the case that you do need some fields from itemlist, I'd consider making it a separate sql statement ('select ..stuff.. where itemid = ?') rather than a join, it may or may not improve things.

        I've omitted the SQL portions that grab several other fields from the item table at the same time as sorting on the second table, as to prepare all this for perl and HTML output. I know that if I just wanted the average, I don't even need to JOIN.
        Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
Re: (OT) Improving a SQL JOIN statement
by Rhandom (Curate) on Apr 10, 2001 at 19:44 UTC
    Looking at the SQL, it looks like MySQL. But if it is or isn't there are somethings that are important to consider. Databases such as Oracle perform "WHERE" clauses from reverse. Often in my attempts to limit the number of records I would put the most limiting first and the most painful last. This was wrong (actually, under newer versions of Oracle, I think it may not matter as their optimizer takes care of this). The other thing to becareful of is to not do anything in your where clause that manipulates a column which is indexed (such as using like, translating a date, or adding numbers). This will prevent it from using the index. Lastly, do you need to do the "ORDER BY". The database can begin handing back results as soon as it finds one if you haven't done an "ORDER BY". These are all small things (well sometimes they are huge), but they can add up to poor performance.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://71323]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (7)
As of 2024-04-18 11:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found