mascip has asked for the wisdom of the Perl Monks concerning the following question:
Hi all,
this message is fairly long, just read the bold text if you want to get a rough idea of what it's about.
I'm fairly new to programming and i've read a lot in the last 9 months. It's good, i've got lots of theoretical ideas. But now that i start using them i realize that it's not so simple to make design choices.
As an indication, the last book that i read (and found very inspiring) is "Growing Object-Oriented Software Guided by Tests", about TDD (Test Driven Development); for those who read it you'll know what my background is.
I'm going to present you a very simple program that i'm trying to make now, and ask questions about design. I have no question about the implementation (which CPAN modules to use, etc); but i'm wondering which structure this project should have (in terms of Object and Roles, and their relashionship).
It is quite a small project, so i could just fit it all in one big dirty script. But i've decided to make it a design exercise, in order to assimilate and experience these ideas that i've read about.
I would really enjoy discussing this with other people, as i feel quite discombobulated right now : there's so many ways of doing the same thing!
With this program, i want to
- find directories with spreadsheets of interest
- read all spreadsheets from a directory simultaneously
- calculate SOME_STUFF($num_line) on each line
- analyze and display the results
As I told you, it's a fairly simple program. Maybe that's why there's many possibilities for design. I'm trying to make it as cleanly and elegantly as possible. Mostly following ideas from this TDD book i read.
A bit more information :
- i have maybe 1000 or 2000 such spreadsheets, so i wouldn't do it by hand.
- for each line, to calculate SOME_STUFF($num_line) i need information from all of the spreadsheets in one directory, and i need information from a lines around $num_line-2, $num_line-2 and $num_line+1 (in all spreadsheets from the same directory).
- The lines in each spreadsheet correspond : they are data at one point in time, and the time data is the same for all spreadsheets within a directory (obviously, i will test this).
- the spreadsheets are fairly big (1MB), so i COULD read all of the spreadsheets FIRST, and then process the results. Easy design solution. But that would take up lots of memory, and thus probably be slow. Maybe it's not that much memory in fact ??
But well, anyway, i would like to try and process the data "on the go" (while reading it) if possible, as it represents a kind of "design challenge".
I started by implementing a very simple program, to which i will add features one by one (which people call "incremental programming").
The first feature i've implemented is to read information from a spreadsheet. Then i added a few more stuff: not reading the header lines, reading only certain lines and columns, changing the name of the fields (i personnalize them).
I still haven't implemented any of the calculation stuff, but i already feel like it's time for some refactoring.
At the moment, i have a Main.pm object which does everything, i want to make it more lightweight. To create objects or roles to take some responsibilities (i would like to follow the "one responsibility per class" design principle).
I had two naive ideas on how to do this :
- create a Read::My::Spreadsheet::Files role, which would encapsulate all the sugar CSV reading methods
- create a My::Spreadsheet::Reader class, would enable me to easily return the result from each spreadsheet one by one. But that wouldn't enable me to process the data "on the go".
I've done both in fact, just to try and play. But i still don't know how it's going to fit with what i'm doing next.
The next feature i want to implement is calculating SOME_STUFF() for one particular directory.
I'm thinking of creating a My::SOME_STUFF::Calculator object (or maybe a Role???) to do the calculations.
If i had a My::Spreadsheet::Reader class, i would first read and gather all the data for a directory in Main.pm, and then calculate everything.
But if i want to do it "on the go", i don't really know how to do it. Should My::SOME_STUFF::Calculator "do" the role Read::My::Spreadsheet::Files, and thus use sugar spreadsheet reading methods to make the job easier (and more elegant)? This would mean that i would have a My::SOME_STUFF::Calculator object, which takes "messages" from a role that "bridges" with the realm of spreadsheets. Right?
Later i will have to process several directories. I'm guessing that i will search for them in Main.pm (or calculate_some_stuff.pl), and then use a My::SOME_STUFF::Calculator object for each. And finally, for analyzing and displaying results, i could put the methods in a My::SOME_STUFF::Result::Display object.
It is a simple project but it's long to explain.
Please, give me some feedback.
I guess i will learn by playing with the code and trying different things, buy asking experienced people can help a lot too. Hopefully, different ideas will get said, and an interesting discussion on design could emerge, for more people to learn together.
Re: Design elegance : How to best design this simple program ?
by afoken (Chancellor) on Jun 18, 2012 at 16:45 UTC
|
i have maybe 1000 or 2000 [...] spreadsheets
Not exactly what you asked for, but perhaps you should get rid of spreadsheets and use a RDBMS instead.
Alexander
--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
| [reply] |
Re: Design elegance : How to best design this simple program ? (modular)
by tye (Sage) on Jun 18, 2012 at 18:49 UTC
|
But well, anyway, i would like to try and process the data "on the go" (while reading it) if possible, as it represents a kind of "design challenge".
So, you realize that processing the data "on the go" will make for a complicated design. For me, that means "bad design" not "design challenge".
You don't get better at design by flexing your design muscles trying to do a really good job at figuring out a design for an overly complex problem.
You get better at design by learning how to recognize the most simple and cohesive parts of a problem and how to factor them out so that the parts left to be designed are fewer and/or smaller until you get to the point that everything is simple to design.
When I see something that is complicated to design, my primary reaction is to assume that things have been factored badly and reconsidering how some things have been factored is the best next step because factoring them "better" may leave me with a much simpler design problem.
Complex design problems that don't get factored into a bunch of simple design problems with minimal interconnectedness turn into complex designs and complex software and repeated failures.
That is, the correct design step when given the design challenge you gave to yourself is to say "Doing calculations 'on the fly' makes things complicated and makes it harder to separate concerns and so makes for less modular design. Is there a fundamental reason that calculations have to be done 'on the fly'? If not, we should just drop that idea from the design."
Congratulations, you will have successfully solved your design challenge when you drop it.
| [reply] |
|
Hi all, thank you for interesting, helpful answers.
I should have said what it is about before, but it wouldn’t help with my problem, nor my learning process. It’s a program for processing data from experiments. I got the data from biologists, who grow phytoplankton in controlled environment (i’ve got data from more than 100 experiments). There’s data from different captors, each of them retrieving one or more measurements. And i need to do fairly complex operations to determine parameters for mathematical models of population dynamics.
So first, i need to preprocess, in order to get “aggregated information” for each line (each line corresponds to a given moment/timestamp). Then, i will estimate parameters. But i’m only speaking about the information aggregation here. Estimating parameters will come later, and i’m not sure i’ll do it with Perl. I’ll see later, that's another question.
Now, back to my design questions :
1. I don’t make the spreadsheets myself. But i could fit all these data in a database to get help from MySQL queries. I didn’t think about this. What i don’t like about it is that
- i need to do one extra operation (organise and transfer my data to the database)
- i need to learn how to use MySQL again. It’s not complex i know, but i haven’t used it in 6 years now.
And i’m thinking that what MySQL can do, Perl can do. Am i wrong thinking this?
2. I don’t use globs for finding files, but the File::Find::Rule CPAN module. Are globs better ?
3. Thank you for inspiring comments :
- Design is about learning to recognize the most simple and cohesive parts of a problem. When the parts are smaller, everything is simple to design.
- Write tests to... test your understanding of the problem. => That’s what i discovered recently by writing my first tests before coding. Thank you for the phrasing, it was good to read.
- First, write a fast and dirty proof of concept (well, after having written a test for the feature i’m implementing) : this is the best way to learn about the problem. Then only, refactor and redesign.
4. I agree with your various comments : my design problem will be solved when i will drop it (nice phrasing, tye). I can do i all with a script. For each experiment, I could just read each of the spreadsheets, store the data into a hash, and then process it all. That IS making things easier, and thus better design.
Nonetheless, yesterday i wrote stuff down on paper and came down with a rough idea of how to do process stuff “on the fly”. I’m going to write “dirty code” here and now (not compiled and probably won’t compile - i’m on holidays without my computer), copying directly from the paper to the forum. I’m not going into the details of the implementation though: just the rough idea.
The key for “reading on the go” was to create data Readers, and then pass them all to a method that reads lines simltaneously.
The objects i will create are :
- Reader::Data, that read data with a get_next_data() method, for a given file
- Experiment, which will know the directory path for the experiment, and the paths for its data files too
- possibly some objects for representing the data, or data sets,
with methods to do some calculations on them. But that’s another story.
And no Roles needed, as i just don’t need them now.
Here is what i would have done (but won’t, thanks to what you all said), for those who would be interested :
# - - - process_bio_data_in_directory.pl
# Responsibility : process global results for all experiments.
# Where to find the experiment directories and data files
my $DATA_DIR = ‘C:/bio_data/’;
my $experiment_dir_regex = qr/Exp/;
my $data_file_regex = qr/data_/;
### Comment : i lined these three = signs, but the different font of t
+he <code> didn't leave them lined. That is very annoying,
# 1. Find all the experiment directories, and their related data files
my $list_experiments = find_experiments_and_their_data_files_in({
dir =>$DATA_DIR,
with_experiment_regex => $experim_dir_regex,
with_data_file_regex => $data_file_regex,
});
### Comment for the reader :
### an Experiment object will have a path, and a list of data_files.
### I feel that this class makes my code more readable, and i can use
+$data_file_regex here,
### and not have to think about it again.
# 2. Process the data for each experiment
my %global_results;
EXPERIMENT:
foreach my $experiment ( @{$list_experiments} ) {
my $hash_aggr_infos = calculate_aggr_info_for_experiment($experime
+nt);
process_global_results_with( $hash_aggr_infos, \%global_results );
+
}
Then, the “on the go” data processing is organized in calculate_aggr_infos_for_experiment().
# - - - calculate_aggr_infos.pm
package Calculate::AggrInfos;
# Responsibility : calculate the aggregate information for one experim
+ent.
sub calculate_aggr_infos_for_experiment {
my $experiment = shift;
# 1. Initialize Readers for each data file
my $hash_data_reader_of_file = initialize_data_file_readers_for_ex
+periment($experiment);
# 2. Calculate aggregate information
my $hash_aggr_infos = calculate_aggr_infos_with_readers( $hash_dat
+a_reader_of_file );
return $hash_aggr_infos;
} # - - - end sub calculate_aggr_infos_for_experiment()
And next, the calculate_aggr_infos_with_readers() subroutine, which reads the files in parallel and processes them.
# in the same file and package as before
sub calculate_aggr_infos_with_readers {
my $hash_data_reader_of_file = shift;
my @data_files = keys %{$hash_data_reader_of_file};
DATA:
while ( $hash_data_now = get_next_data_for_all_readers($hash_data_
+reader_of_file) ) {
# check that the time value is the same for each set of data
check_if_time_is_the_same_for_all_data($hash_data_reader_of_fi
+le);
# calculate aggregated information
my $hash_aggr_infos = calculate_aggr_infos_from_data($hash_dat
+a_now);
# it’s a bit more complex, as i need a bit of “past data histo
+ry”
# to calculate the aggregated information
} # end while (DATA)
} # - - - end sub calculate_aggr_infos_with_readers()
That’s it ! i won’t go into more detail.
Sorry for posting code that doesn’t work, and must contain many mistakes. It mut not be nice to read.
Any comments are very welcome if you had the courage to read all this. Even just on code layout, or the way i name my variables. I’d like to improve this for readability, too.
| [reply] [d/l] [select] |
|
| [reply] |
Re: Design elegance : How to best design this simple program ?
by daxim (Curate) on Jun 18, 2012 at 17:19 UTC
|
This is one of those "impossible" questions. Without the whole code, or at least precise specs and information about what else is involved, no one can give serious recommendations about the design. The outline in the paragraph starting with "With this program, i want to …" sounds like this is done with a single script with a bunch of subroutines, and a class would already be over-engineered.
I say: spend less time thinking about the code structure (classes/roles) itself, and more time documenting the edges between your units of code, i.e. subroutines and their parameters and output, pre- and post-conditions. These parts will be immensely useful for writing tests, and live on even after refactoring, upgrading and other maintenance. | [reply] |
Re: Design elegance : How to best design this simple program ?
by cavac (Prior) on Jun 18, 2012 at 20:08 UTC
|
I think you are overthinking the problem. Don't just go and design a whole program before you have written a few simple tests to... well, test out your understanding of the problem.
Then, expand on what you have learned.
I know, that's not how to teach it in school. But so far (20+ years software development), this approach has served me quite well. Of course, everyone has his/her own methods and processes to write a program (no matter what company policy says).
Let's try it with a simple example. Maybe i can explain how i usually go about writing a program like these. I won't go too far into it, just the first few steps along the way.
Ok, say you have some CSV files by which you manage your families central piggy bank (see note below). For every family member, you run a separate file in which you record each deposit and withdrawel. Overdrawing is possible as long as there is money in the pig. You want to find out two things: First, the amount of money each family member has (or owns you), and second, the total of money left in the bank ("the Central PIG Money Storage Inc. (non-profit)").
There are three files:
piggybank/brother.csv
piggybank/sister.csv
piggybank/mother.csv
piggybank/test.csv
brother.csv reads:
deposit;1
beer;-30
beer;-20
won bet;200
sister.csv reads:
deposit;1000
buy coffee;-20
buy lunch;-30
new shoes;-500
deposit;20
mother.csv reads:
deposit;50
deposit;30
deposit;300
small car accident;-550
test.csv is a directory that your brother made to break your program ;-)
First step is of course finding the filenames and making sure they are in fact files (Users can be accidently-on-purpose very creative about this things). Just print out what we find and generate warnings about non-files:
#!/usr/bin/env perl
use strict;
use warnings;
# get a list of all CSV files in the piggybank directory
my @fnames = glob('piggybank/*.csv');
foreach my $fname (@fnames) {
if(!-f $fname) {
print STDERR "$fname is not a file!\n";
next;
}
print "Found data file $fname\n";
}
Ok, that works. For this example, we won't bother using a "real" parser module. You should for your program, but this would make this example too complex. In the first step, we just want to output the content of each file, so we ad a subroutine and call it for every file. Here's the modified code:
#!/usr/bin/env perl
use strict;
use warnings;
# get a list of all CSV files in the piggybank directory
my @fnames = glob('piggybank/*.csv');
foreach my $fname (@fnames) {
if(!-f $fname) {
print STDERR "$fname is not a file!\n";
next;
}
readAccount($fname);
}
sub readAccount {
my ($fname) = @_;
open(my $fh, "<", $fname) or die($!);
foreach my $line (<$fh>) {
chomp $line;
print $fname, ': ', $line, "\n";
}
close $fh;
}
The next part is doing the sum for each file and printing the result. We know that every valid line is in the format
sometext;value
where withdrawels are always negative numbers and deposits positive ones. So we do it very simply, matching the line with a regular expression and take everything after the semicolon. (A bit error prone, but enough for this example). Tread that scalar as a number and just add it to the accounts balance.
Ok, here we go, we only have to modify readAccount() for this.
sub readAccount {
my ($fname) = @_;
my $balance = 0;
open(my $fh, "<", $fname) or die($!);
foreach my $line (<$fh>) {
chomp $line;
if($line =~ /(.+)\;(.+)/) {
$balance += $2;
}
}
close $fh;
print "$fname balance: $balance\n";
}
Now, where's nearly there. All that's left to do is the total balance of our piggy bank. We already have the balance for each individual account. We just have to modify readAccount to return it and sum it all up in the main loop. And then print it out.
#!/usr/bin/env perl
use strict;
use warnings;
# get a list of all CSV files in the piggybank directory
my @fnames = glob('piggybank/*.csv');
my $total = 0;
foreach my $fname (@fnames) {
if(!-f $fname) {
print STDERR "$fname is not a file!\n";
next;
}
$total += readAccount($fname);
}
print "Money left in the piggybank: $total\n";
sub readAccount {
my ($fname) = @_;
my $balance = 0;
open(my $fh, "<", $fname) or die($!);
foreach my $line (<$fh>) {
chomp $line;
if($line =~ /(.+)\;(.+)/) {
$balance += $2;
}
}
close $fh;
print "$fname balance: $balance\n";
return $balance;
}
This is our final output of the script (i called it bankman.pl):
piggybank/brother.csv balance: 151
piggybank/test.csv is not a file!
piggybank/mother.csv balance: -170
piggybank/sister.csv balance: 470
Money left in the piggybank: 451
No need to write complicated designs (which wont work out exactly as planned most of the time anyway). making an "overview" design sketch for bigger projects is a good thing. But don't get bogged down in the details.
Since you learn most about the problem at hand is actually hands-on solving it, the best time to draw out the theoretical design for the program is after you have written it. That's why many experienced coders write a quick and dirty proof-of-concept (and maybe some test cases) first, then tackle designing an elegant, optimized solution...
...that is, if it's still required. From my personal experience, the more often you do this, the more often you will come up with a quick-hacked proof of concept that is good and fast enough to be also the final solution.
Another thing you'll find: Most of the time you don't actually have to optimize and squeezeyour code every bit of performance you can get. One thousand million bytes seem like a huge amount of data for a human. For a computer capable of doing more than 3 billion operations every second, not so much.
While i was running the example program (which took about a second including loading the perl binary, compiling/parsing/running the programm, accessing the disks, printing out the results to a graphical terminal) i was also running a YouTube-video with audio through my USB headphones (which is much more data shuffled around in memory than parsing a few Megabytes of data files), my CPU was barely used at all.
So, to conclude: Just try solving your problem step-by-step. When you found a working solution, you can still decide if it's worth a rewrite with a clean, simple and elegant design. Or if you want to work on the next, even more exiting problem that needs solving.
Note to self: Solving problems can be addictive. Remember to leave some for friends and coworkers.
Note on piggy banks (from Wikipedia): Piggy bank (sometimes penny bank or money box) is the traditional name of a coin accumulation and storage receptacle. Sorry, Wikipedia editors, could you describe that even less clearly?
"You have reached the Monastery. All our helpdesk monks are busy at the moment. Please press "1" to instantly donate 10 currency units for a good cause or press "2" to hang up. Or you can dial "12" to get connected directly to second level support."
| [reply] [d/l] [select] |
Re: Design elegance : How to best design this simple program ?
by pvaldes (Chaplain) on Jun 18, 2012 at 18:37 UTC
|
- find directories with spreadsheets of interest
Please define "spreadsheets of interest" How do you think you could take this files apart and not another spreadsheets uninteresting? that's a point that you'll need to solve
i need information from ALL of the spreadsheets in one directory
read about glob and wilcards like *.ods, *.xls or so
extension (file type) of your spreadsheets?
and i need information from a lines around $num_line-2, $num_line-2 and $num_line+1.
Didn't understand this part, sorry
- read all spreadsheets from a directory simultaneously
why simultaneoulsy? read file by file in the list provided by glob. A 1M file shouldn't be a real trouble to process
- calculate SOME_STUFF($num_line) on each line
what type of stuff?
- analyze and display the results
Probably the easy part | [reply] |
Re: Design elegance : How to best design this simple program ?
by Jenda (Abbot) on Jun 19, 2012 at 09:13 UTC
|
If I were to do something like this I'd forget about trying to work directly with the spreadsheets. I'd look at the data in them, designed a database schema, imported data from the files into a database and computed whatever I needed using SQL. The point is that even though you may have a fairly well defined task now, it's almost inevitable that sooner rather than later you will need to compute something different. And then it will be nice to already have the data in an easy to work with format.
As a design exercise, if you decide you are not going to use a database, it's IMHO best if you do it the best you can, learn from your mistakes, refactor the code, see what was hard to change and what was easy, learn from your mistakes, redesign the program from scratch, write new code, see where was it better or worse than the original design, ...
Jenda
Enoch was right!
Enjoy the last years of Rome.
| [reply] |
Re: Design elegance : How to best design this simple program ?
by Anonymous Monk on Jun 19, 2012 at 15:57 UTC
|
Another thing that you can do with ODBC in the Microsoft environment is to use a spreadsheet (page) as a data source, e.g. for Microsoft Access (or even for queries that are made within an Excel spreadsheet). The very first thing that you need to learn about software design is that There Is More Than One Way To Do It; therefore, explore all of the ways. | [reply] |
|
|