Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Help with removing dupes from a string with perl

by jrsimmon (Hermit)
on Jul 09, 2009 at 12:00 UTC ( [id://778550]=note: print w/replies, xml ) Need Help??


in reply to Help with removing dupes from a string with perl

JavaFan has the right idea. This is more easily done by a properly formatted SQL select statement.

use strict; use warnings; use DBI; my @alljobData = (); my $dbh = DBI->connect(...); my $stmt = "SELECT JOBNAME, STATUS, TIMESTAMP, COMMENT FROM YOUR_TABLE +S WHERE JOBNAME = ? AND STATUS = ? ORDER BY TIMESTAMP DESC"; my $sth = $dbh->prepare($stmt); foreach my $job (@jobname){ my $rc = $sth->execute($job, "HOLD"); #either check for an error or let failing on fetch do that for you if(my @thisjobData = $sth->fetchrow_array){ push(@allJobData, @thisjobData); } }

You could actually write the sql in a way that it would pull only the most current row for every job name in a single statement, but I don't know the syntax off the top of my head.

Replies are listed 'Best First'.
Re^2: Help with removing dupes from a string with perl
by tilly (Archbishop) on Jul 09, 2009 at 16:02 UTC
    There are several ways to do it, depending on your database. The most portable is:
    SELECT t.jobname , t.status , t.timestamp , t.comment FROM your_table t JOIN ( SELECT jobname, MAX(timestamp) as timestamp FROM your_table GROUP BY jobname ) max_timestamp ON max_timestamp.jobname = t.jobname AND max_timestamp.timestamp = t.timestamp
    Of course this assumes that you don't have 2 entries with the same name and the same timestamp. It is usually less efficient than selecting back and filtering outside of the database.

    Google tells me that Sybase doesn't yet support the better SQL 2003 solution of analytic queries.

    Also a note about your solution. I would strongly recommend ordering on jobname and then timestamp. This will put all of the records for each jobname together, making it easy to figure out which is first and which you should throw. Otherwise you need to keep a potentially large hash of which jobs you've seen. (You skipped that step in your code.)

      Heh, I just rolled my own first() and last() aggregators in Postgresql yesterday because I wanted something like this (googling for analytic queries found a somewhat related link that mentioned first() and last() aggregators as I was curious if my solution was in-line with SQL 2003's solution).

      The process was a bit more interesting and difficult than I expected. It convinced me that Postgresql doesn't have a general solution for user-defined polymorphic aggregators because it doesn't allow polymorphic user-defined data types (you can't use "anyelement" as part of a "create type" construct -- at least in the versions I was dealing with, 8.2 and 8.3).

      So I was able to create some polymorphic aggregators but can see cases where these approaches would not work (and the first route I tried failed).

      The following test query:

      select first(v), first_nonnull(v), last(v), last_nonnull(v) from ( select NULL::text as v union all select 'one' as v union all select 'two' as v ) as q union all select first(v), first_nonnull(v), last(v), last_nonnull(v) from ( select 'one'::text as v union all select 'two' as v union all select NULL as v ) as q;

      produces the following results:

      first | first_nonnull | last | last_nonnull -------+---------------+------+-------------- | one | two | two one | one | | two (2 rows)

      - tye        

        Sorry, no. Your solution is nothing like SQL 2003's analytic functions which come from Oracle. If you want a brief introduction to them, read http://www.orafusion.com/art_anlytc.htm. The syntax is..heinous. To use them effectively you often want to query a subquery. But they are very powerful and much more efficient than older solutions.

        Actually, despite tilly's response, these are exactly like some of what Postgresql calls "window functions" which are exactly the same thing that Oracle calls "analytic functions" (Oracle uses "window functions" to refer to a subset of the "analytic functions").

        One difference between my user-defined aggregate functions above and the standard window functions first_value() and last_value(), is that my aggregates can be used as simple aggregates (such as with just "GROUP BY") while first_value() and last_value() must have a window defined (via "OVER"). That is, my aggregates can be used with OVER or without OVER.

        But the main reason I am responding is to note that my third attempt at defining first() and last() was much simpler. It is just like the _nonnull versions except you drop the word 'strict'.

        create function first( one anyelement, two anyelement ) returns anyelement as $$ begin return one; end $$ language plpgsql; create function last( one anyelement, two anyelement ) returns anyelement as $$ begin return two; end $$ language plpgsql; create aggregate first( anyelement )( stype = anyelement , sfunc = first ); create aggregate last( anyelement )( stype = anyelement , sfunc = last );

        Which makes me wonder what my first try (that didn't work) looked like.

        I'm now working with newer Postgresql that supports window functions. But this version's first_value() doesn't support the standard "IGNORE NULLS" option which looked like the best solution for a problem I was working on today. So I asked google where my implementations were so I could just use them.

        Swapping in these details again made me suspect that the first() and last() implementations could be as simple as the first_nonnull() and last_nonnull() implementations. Testing on this newer version of Postgresql showed that they could be.

        - tye        

      In Sybase you can use a a group by and a having clause, even without using any aggregates in the main query.

      So something like

      select jobname, status, timestamp, comment from the_table group by jobname having timestamp = max(timestamp)
      should return something reasonable without the need for a sub-query.

      Michael

        For the record, I hate that "feature". What it means is that if you accidentally leave a field that is in the SELECT out of the GROUP BY, you'll introduce a join. Depending on the query, this join can be very large and very poorly behaved. Particularly if you've left joined to a large table.

        When I worked with Sybase I think I made this mistake every few months (with the left out field from a very large table), and inevitably didn't notice until my query had been grinding away for half an hour and the database was feeling the pressure. The database it usually happened on was used for production purposes, but luckily the consequences were only internally visible, and very few people were affected. Still it was my second greatest annoyance with Sybase. (They've since fixed my greatest annoyance at the time, which was page level locking.)

Re^2: Help with removing dupes from a string with perl
by Herkum (Parson) on Jul 09, 2009 at 13:11 UTC

    The SQL for looking up a unique column will roughly look like this(depending upon your database),

    SELECT DISTINCT(JOBNAME), STATUS, TIMESTAMP, COMMENT FROM YOUR_TABLES +WHERE STATUS = ? ORDER BY TIMESTAMP DESC

    Now you remove that loop for looking up individual job names.

      With DB2, using DISTINCT in that manner won't guarantee that you get the most recent result. DISTINCT will be evaluated before the ORDER BY clause and may not find the most recent record first. I'm not sure whether this behavior is standard among databases or not.

      From the db2 doc:
      When values match in the specified columns of two or more rows, the distinct operator retains only one of the duplicate rows in the result data set and discards the others. When duplicate rows are found, only the first row that is read during processing is retained. Because the order of rows is not guaranteed during processing, choose carefully when you select columns for the distinct operation.
      To ensure you get the most recent occurrence, you'd probably need to use a temp table in the statement (at least that's how I've seen it done).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://778550]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (9)
As of 2024-04-23 17:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found