Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Please provide a hint for me to continue with the rest of my program

by pooyan (Initiate)
on Apr 23, 2013 at 20:37 UTC ( #1030233=perlquestion: print w/replies, xml ) Need Help??
pooyan has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl Monks,

I hope you are doing very well.

This is a typical Perl question that is being passed to candidates by recruiters and I cannot yet answer it despite having seen it 3 times. At this point I am determined to solve the problem correctly and grow up because of it. Please give me some hints, just hints as to how best approach the solution.

I am going to post what I know and what I don't know first, and then post the test instructions and the associated module, I think this is better:

I know how to use the subroutine from the perl module and know that need to connect to a database or to query using dbConnect and query. I have created my own mySQL database from XAMPP with a sample table based on the code. This is a one-column table that gets email addresses and wants to use a Perl program to update another table having count by domain. How do I process and output a count without using complex SQL queries/subroutines (as requested by the test instructions posted below - it says use Perl to process it)? I can use a select * to give me an array of all e-mails, I can go through each record and extract domain so to get an array of domains, then I can sort the array. I think I can go through each record and save it in a temp val (say temp = ""), put this in a hash table (whose key is domain and value is count). In the next pass, I check another record if this is == to temp (so if it is again, update the hashtable with using the key and by incrementing the value - which holds the count. Finally, then I would use another loop and insert statements to translate the hash in to the domain counting table in DB. How correct is this approach?

Another problem I have is that, it says find daily count. since there is no date column in the original table, I am not sure how to just bring up the emails added on a particular day.

For top 50, I just use my sorted hash in descending order and restrict my loop to 50 passes and print the values.

Given a table 'mailing': CREATE TABLE mailing ( addr VARCHAR(255) NOT NULL ); The mailing table will initially be empty. New addresses will be adde +d on a daily basis. It is expected that the table will store at leas +t 10,000,000 email addresses and 100,000 domains. Write a perl script that updates another table which holds a daily cou +nt of email addresses by their domain name. Use this table to report the top 50 domains by count sorted by percent +age growth of the last 30 days compared to the total. ** NOTE ** - You MUST use the provided for all database interaction, and yo +u must use it as it is ( cannot be modified except for the conne +ction settings). - The original mailing table should not be modified. - All processing must be done in Perl (eg. no complex queries or sub-q +ueries) - Submit a compressed file(tar/zip) with the files required to run you +r script.
Here is the
package GUI::DB; use strict; use DBI; use vars qw(@ISA @EXPORT); use Exporter; @ISA = qw(Exporter); @EXPORT = qw(dbConnect query); # # dbConnect - connect to the database, get the database handle # sub dbConnect { # Read database settings from config file: my $dsn = "DBI:mysql:database=test"; my $dbh = DBI->connect( $dsn, 'root', '', { RaiseError => 1 } ); return $dbh; } # # query - execute a query with parameters # query($dbh, $sql, @bindValues) # sub query { my $dbh = shift; my $sql = shift; my @bindValues = @_; # 0 or serveral parameters my @returnData = (); # issue query my $sth = $dbh->prepare($sql); if ( @bindValues ) { $sth->execute(@bindValues); } else { $sth->execute(); } if ( $sql =~ m/^select/i ) { while ( my $row = $sth->fetchrow_hashref ) { push @returnData, $row; } } # finish the sql statement $sth->finish(); return @returnData; } __END__

Replies are listed 'Best First'.
Re: Please provide a hint for me to continue with the rest of my program
by NetWallah (Canon) on Apr 24, 2013 at 04:16 UTC
    For greatest efficiency, and minimal resource usage, make mysql do the heavy lifting. Get the domain counts by:
    my @domain_counts = GUI::DB::query ( $dbh, "SELECT date(now()), substr(addr,locate('@',addr)+1) as maildomain +, count (*) as mailcount FROM mailing GROUP BY maildomain ORDER BY mailcount DESC" );
    After that, the storage, and top-50 selection becomes easy. Here is the relevant SQL...
    INSERT INTO dailydomaincounts (maldate,maildomain,mailcount) VALUES +(?,?,?); SELECT SUM(mailcount) as TOTAL from dailymailcounts WHERE maildate >= date(now()) - INTERVAL(30 days); SELECT maildomain, sum(mailcount) * 100.0 / $total as monthlymailpct + from dailymailcounts WHERE maildate >= date(now()) - INTERVAL(30 days) GROUP BY maildomain ORDER BY monthlymailpct DESC LIMIT 50;
    Of course, the e-mail splitting at the first '@' is not the worlds most robust implementation, but given the other artificial constraints imposed, it should suffice.
    SQL not tested.

    SQL "query complexity" is a rather vague, subjective term - I do not consider the above queries to be "complex".

                 "I'm fairly sure if they took porn off the Internet, there'd only be one website left, and it'd be called 'Bring Back the Porn!'"
            -- Dr. Cox, Scrubs

      About daily counts & NetWallah's proposal: Alas the (rather arbitrary) problem constraints include a table that has no time/date- it only has one column "addr" for email addresses. I assume that it grows continually.

      So for the daily count, have the program run once a day- the same program that counts the # of domains can also count how many rows are in the table, and store it (in a file, or in the database, whatever is allowed). Then on the next run it can subtract to find the number of new records. That matches the Anonymous Monk's earlier suggestion.

      In fact, you could make the program more efficient- though more complicated- by also storing the domain count, and re-reading it on startup, then skipping over the old records on the next run. Then you only need to add up the new domains, and add those to the old totals. That will only work if addresses are returned in the order they are created! If you go that way, document that assumption!

        The problem statement says:
        New addresses will be added on a daily basis.

        I assumed that to mean that the table will be emptied daily, and only new addresses would be added.

        They said the table would be clean initially. (If the person putting forth the constraints can impose conditions, so can I).

        With this assumption, no additional tracking is necessary.

                     "I'm fairly sure if they took porn off the Internet, there'd only be one website left, and it'd be called 'Bring Back the Porn!'"
                -- Dr. Cox, Scrubs

      Hi Yari. Thanks for your input, it is definitely a great use of SQL and more learning; however, the question say all processing needs to be done in Perl and that complex SQL and subqueries are not allowed. I think we need to get @returndata and go through it with loops in Perl.
        Hmm, you may have my answer mixed up with another. What I meant when I said "It could be simpler than you have it" is that, after you write your program as you initially described it (which should work), you can simplify it by removing some unnecessary steps, and it will still work, and be easier to understand.

        And you don't even need to worry about these simplifications now- write the code and you might even find them as you're writing it. You have a plan, make something that works first. Then post the code here and get some critiques, if you like.

Re: Please provide a hint for me to continue with the rest of my program
by Anonymous Monk on Apr 23, 2013 at 21:37 UTC

    One way of creating an implicit date range is to have cron run the script every day at midnight. The difference between yesterday's stats and today's stats is the amount done today.

    If you have a monotonically increasing ID on the entries you are counting, then you could also save the highest ID number at midnight each night, so that you know what day each ID was created.

Re: Please provide a hint for me to continue with the rest of my program
by Yary (Pilgrim) on Apr 24, 2013 at 12:30 UTC
    Your plan for counting domains will work, but it has some extra steps. It could be simpler than you have it. There's a way to make the count without creating an array for the domains. Hope that's enough of a hint, without being too obvious!

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1030233]
Approved by graff
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2018-02-24 18:16 GMT
Find Nodes?
    Voting Booth?
    When it is dark outside I am happiest to see ...

    Results (310 votes). Check out past polls.