Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Trimming hash based on element uniqueness

by mhearse (Chaplain)
on Jun 17, 2008 at 21:21 UTC ( #692588=perlquestion: print w/replies, xml ) Need Help??

mhearse has asked for the wisdom of the Perl Monks concerning the following question:

I have a hash with several thousand elements. It basically contains items from the slow query log. To avoid duplicate or very similar queries, I'd like to trim down the hash based on the uniqueness of the query element. How can I do this? The query will probably differ only slightly. I guess I could start with an eq match. But that will not get them all. Thanks in advance.
my %sq_ds_uniq; for my $key (keys %sq_ds) { for my $key_uniq (keys %sq_ds_uniq) { ### This line is metacode last if $sq_ds{$key}{query} is like $sq_ds_uniq{$key_uniq}{que +ry}; } $sq_ds_uniq{$key} = $sq_ds{$key}; }
'11102' => { 'user' => 'sql_user', 'rows_sent' => '0', 'rows_exam' => '32348', 'query_time' => '55', 'log_time' => '13:11:13', 'query' => 'select something from somewhere', },

Replies are listed 'Best First'.
Re: Trimming hash based on element uniqueness
by BrowserUk (Pope) on Jun 17, 2008 at 22:16 UTC

    I think you are approaching the problem from the wrong end.

    Trying to determine the similarity of queries by comparing the SQL is problematic. For example, all these queries are textually similar, but produce quite different results sets:

    select thing from tableA; select thing from tableA limit 1000 offset 1000; select count( thing ) from tableA; select thing from tableB;

    Conversely, it not hard to dream up two or more radically different queries that produce identical results sets. Throw in a few joins and sub-selects and you would likely never determine their similarity.

    If your intention is to try and cache the results from long running queries and re-use them, I think you will have an inordinately big task.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      While this is true in the general case, in the concrete case, you usually have all queries generated by a fixed program, and most likely these queries differ only in their interpolated values (unless the application programmer had been smart and used placeholders, but then the whole point would be moot). In such a case, it can still make sense to check the queries for similarity by reversing the interpolation process and parsing the statement back into a placeholder query.

        Okay. I see what you mean. If there are a few dozen or hundred select * from tableX where id == '...'; with only the '...' varying, you might want to reduce them to one generic select and do a local hash lookup instead of issuing a query for each.

        Quite frankly, this is one of those situations where I'd print all the selects to a file, one per line, load them into my editor, sort them. And then step through them manually breaking them into groups. Even with a few thousand to process, doing this manually would probably be far quicker than deriving a heuristic, coding and algorithm and testing it. And, it does seem like a process one would have to repeat frequently enough (or at all for any given application, if you did it right the first time), to consider automating.

        I suppose if this were an application where the queries issued are derived and constructed from user input, then you might need to repeat the process a few times. Even then, doing a sort & group process manually, and then inspecting each group looking for caching opportunities and a heuristic to recognise each one is probably far simpler than trying to come up with a totally generic mechanism.

        Were I tackling this, in addition to the 1 per line list of the queries, I'd also try an obtain/generate a "sample output" from each one. That is, a representation of the output each select produces in terms of the table/field pairing and number of rows produced, rather than the actual values:

        (table.field1, table.field3, table.field7) x 60

        It might then be possible to prefix this information to queries themselves prior to sorting and use that to aid in the grouping. I'd still derive the generic query manually the first few times, and if the caching is successful, the numbers of long running queries should be very quickly reduced by the caching process, which should mean that it becomes redundant to automate it.

        Probably the hardest part would be recognising which queries (at runtime) can be redirected to the cache, but manually deriving a regex or two to recognise the pattern(s) of queries in each group replaced by a generic query, would be fairly easily done manually I think.

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Trimming hash based on element uniqueness
by injunjoel (Priest) on Jun 17, 2008 at 21:47 UTC
    Greetings, you might try making the keys of your hash the query, or perhaps a hash of references to your data with the query as the key, that way you can't have duplicates (like an pseudo-index of your larger structure where each element is a reference to the actual data, keyed of course by the query). The point is to construct a key that will collapse your slight variants of your queries, while still being distinct enough to not collapse everything. Have a look at The Uniqueness of hashes. if you have time.

    "I do not feel obliged to believe that the same God who endowed us with sense, reason and intellect has intended us to forego their use." -Galileo
Re: Trimming hash based on element uniqueness
by Anonymous Monk on Jun 17, 2008 at 22:10 UTC
    Your code would not work correctly since the last would only break out of the inner for loop and the line
    $sq_ds_uniq{$key} = $sq_ds{$key};
    would always be executed. I would propose the following untested solution, where the function query_like would take the role of your "is like":
    my %sq_ds_uniq; my @uniq_queries = (); for my $key (keys %sq_ds) { my $query = $sq_ds{$key}{query}; if (!grep {query_like($_, $query)} @uniq_queries) { $sq_ds_uniq{$key} = $sq_ds{$key}; push @uniq_queries, $query; } }
      I just wanted to take responsibility for the last node, sorry, I was not logged in at the time... So if you want to cast a vote or comment, please do it here.
Re: Trimming hash based on element uniqueness
by stark (Pilgrim) on Jun 18, 2008 at 19:26 UTC
    Have a look at the tools mysqldumpslow and mysqlsla - both do a quite good job. And are written in Perl....

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://692588]
Approved by kyle
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2020-10-30 05:50 GMT
Find Nodes?
    Voting Booth?
    My favourite web site is:

    Results (278 votes). Check out past polls.