Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

File Loader (load the content of a file and insert into DB)

by r34d0nl1 (Pilgrim)
on May 20, 2005 at 20:08 UTC ( #459113=perlquestion: print w/replies, xml ) Need Help??

r34d0nl1 has asked for the wisdom of the Perl Monks concerning the following question:

I'm going to develop (with your help, I hope) a loader that will receive files that will be
drop in a directory X and then will read the file, do some verifications at the content
and insert the information from the text file into a database.
What you think would be better? A script that runs every minute from crontab;
or a program running as service?
It's gonna run under unix HP UX and Perl version 5.6.1.
I also would like to see some kind of example about loaders (this one will 'talk' with Oracle);
I've never produced one before.
I wounder if you could provide me some link and directions.

Thanks since now for your help, masters.
  • Comment on File Loader (load the content of a file and insert into DB)

Replies are listed 'Best First'.
Re: File Loader (load the content of a file and insert into DB)
by jpeg (Chaplain) on May 20, 2005 at 20:40 UTC
    In my experience, writing a program to stay open and monitor a directory is easier: you readdir the contents, sleep, and readdir again, comparing the arrays and firing off a sub to process the new files. Create a hash and get the mtimes from stat if you want to check if files inside the dir have been changed.
    If you wanted to write a script that ended and needed to be restarted by crond you could look at Storable.

    As far as talking to Oracle - it's easy enough; perldoc DBD::Oracle should tell you what you need to know. It's not much different than talking to any other DB.

    $dbh = DBI->connect("dbi:Oracle:host=$orasrvr;sid=$sid", "$uname","$pa +ss");
    should get you started.

    Loaders can be as simple or complicated as the business logic dictates. You process each line and decide what to do with it. I've had to pull a table from a db into a hash and check if each line of a file existed in the hash, and I've had simple tasks like building a SQL INSERT query aroud data in each line.

Re: File Loader (load the content of a file and insert into DB)
by holli (Monsignor) on May 20, 2005 at 20:25 UTC
Re: File Loader (load the content of a file and insert into DB)
by jhourcle (Prior) on May 20, 2005 at 21:46 UTC
    Depending on how fast you're going to be pumping data into Oracle, you might be better served creating a text file from the new data (after verifying it, etc.), and then calling oracle's sqlldr (aka. SQL*Loader)
      If it's an Oracle 9i db you could use external tables, which is essentially sqlldr reading files as they are needed. It's quite fast as well. Just define the directory, the filename and the file format and you're done.

      But that's getting off topic. I'd go with the cron vs a looping daemon. Just make sure you don't read the same file twice if the first invocation of your code doesn't complete before the next run...
Re: File Loader (load the content of a file and insert into DB)
by bgreenlee (Friar) on May 20, 2005 at 20:32 UTC

    As long as your app is ok with the one-minute delay, I'd just use the crontab. The advantages are:

    - don't have to worry about what happens if your service dies, or setting it up so that it starts automatically if the box reboots

    - you get email notification of errors for free (assuming that you're checking the email of that user's crontab)

    I'm not sure what you mean by "loaders". Do you mean a database API? You probably want DBD::Oracle.


      There are definitely pitfalls of using cron for this kind of thing for production code. Particularly when the cycles are short.

      You should consider what will happen if the task takes longer than the period you've set it to run from cron. If it's bad for overlapping runs (and it usually is, if for nothing else downward performance spirals when the system gets REALLY busy). Various locking schemes are possible to avoid the problems of overlapping runs, but most of them are trickier than it would seem at first and there are various race conditions to be considered.

      But, you make good points about the simplicity of a crontab-based solution.

Re: File Loader (load the content of a file and insert into DB)
by Anonymous Monk on May 23, 2005 at 17:14 UTC
    One thing you have to watch out for here is that the file has actually been fully written. I recently wrote something like this using SGI::FAM, and every once in a while I would get blank contents... I would highly advise using flock to make sure that the new file has finished being copied/moved/written before you start processing it.
Re: File Loader (load the content of a file and insert into DB)
by r34d0nl1 (Pilgrim) on Jun 14, 2005 at 18:10 UTC
    Well, I created a little monster that monitors a directory, load the file (put it into
    a hash and start to load it if it does not change in 3 seconds
    - the way I found to see if the transfer of the file has finished).
    Then it creats a child that will actually load the file and say OK or Fail.
    And the father loader tries to just check it and move the file.
    The status of the file (ready to load, loading or loaded is stored in a hash, until the file is loaded and moved)
    Basically the code (that is running without problems for more than 3 weeks - and I can't believe it :d) is bellow. Please coment it:
    #!/usr/bin/perl use strict; use Data::Dumper ; use POSIX 'setsid'; use File::Copy ; $| = 1; #unbuferred print my $basedir = '/home/emf003/loader'; # base loader + dir my $sourceFile = "$basedir/raw"; # dir to chec +k my $logsDir = "$basedir/logs"; # where lock +and log files will be in my $loaderDcProgram = "$basedir/"; # program to +run my $moveFile2Dir = "$basedir/arch"; # where files + will be moved after beeing processed my $errorFilesDir = "$basedir/errorFiles"; # where files + that could not be loaded will be my $sleep = 5; # how long to + sleep for my $maxProcess = 10 ; # max number +of process in the stack my $processCount = 0 ; my $timeToKill = 180; # 3 (180) min +utes to kill a son of the process; my $maxInternalErrors = 100 ; # max number +of messages (before this daemon kill himself) my $internalErrors = 0 ; # internal er +ror counter. To avoid endless loop my %files = () ; $SIG{INT} = \&catchKiller; # deal with ctrl+c $SIG{TERM} = \&catchKiller; # deal with kill process $SIG{KILL} = \&catchKiller; # deal with kill process $SIG{CHLD} = 'IGNORE' ; # kill zombies :d like resident evil girl +does :d sub daemonize { chdir '/' or die "Can't chdir to /: $!"; open STDIN, '/dev/null' or die "Can't read /dev/null: $!"; open STDOUT, '>/dev/null' or die "Can't write to /dev/null: $!"; defined(my $pid = fork) or die "Can't fork: $!"; exit if $pid; setsid or die "Can't start a new session: $!" +; open STDERR, '>&STDOUT' or die "Can't dup stdout: $!"; } &daemonize(); &check_env(); # very basic env check #print STATUS Dumper(\%files); print STATUS "\nloader started at " . localtime() . "\n"; while (1) { &createInfoFile(); #warn "\n opening $sourceFile "; opendir(BASE, $sourceFile) or "Cant open dir $sourceFile: $!\n +"; #any file; minus . and .. my @dirContent = grep {!/^\.{1,2}$/ } readdir BASE ; closedir BASE ; #main idea: get the file and its size. Wait some seconds and c +heck if its #size had changed. If not, then let's load it. #STATUs A (active - will be checked to see if file size had +changed) # R (ready to be loaded) # L (loading - loader is processing the file) # X (file had been loaded and process finished - can b +e deleted) # F (files had been processed but returned errors) for (@dirContent) { if (!exists $files{$_}) { open (FH, ">>/$sourceFile/$_"); $files{$_}{FILESIZE} = -s FH ; $files{$_}{TIMESTAMP} = time ; $files{$_}{STATUS} = 'A'; close FH ; } } sleep $sleep; &mark2Load(\%files); foreach my $file (keys %files) { if ($files{$file}{STATUS} eq 'R') { if ($processCount < $maxProcess) { if (my $pid = fork) { $files{$file}{STATUS}='L'; $files{$file}{PROCESS}=$pid; $files{$file}{STARTPROCESS}=ti +me ; print STATUS "\n PID $pid proc +essing file $file"; $processCount++ ; } else { exec ( "$loaderDcProgram $sour +ceFile/$file > $logsDir/$$.lock" ) ; exit(0); } } } elsif ( ($files{$file}{STATUS} eq 'L') && (( +time - $files{$file}{STARTPROCESS}) <= $timeToKill) ) { #check if lock file still exists eval { open TEST, "< $logsDir/$files{ +$file}{PROCESS}.lock"; die "not found " unless (-e TE +ST); }; #update status if file not found if ($@ =~ /not found/) { $files{$file}{STATUS} = 'X' } else { my @lines = <TEST> ; if ($lines[0] =~ /^OK$/) { unlink("$logsDir/$file +s{$file}{PROCESS}.lock") or makeLog("\n Could n +ot unlink file $logsDir/$files{$file}{PROCESS}.lock" . "$!"); } elsif ($lines[0] =~ /^FAIL$/ +) { $files{$file}{STATUS} += 'F'; } #if file has not been correct +loaded (OK) it should have a FAIL in the 1st line. #antway it can implement more +'status' in a switch way } close TEST ; } elsif (($files{$file}{STATUS} eq 'L') && (ti +me - $files{$file}{STARTPROCESS}) > $timeToKill) { #Kill all of them. Do not exitet (??) + Show no mercy. kill $files{$file}{PROCESS} ; } elsif ($files{$file}{STATUS} eq 'X') { move ("$sourceFile/$file","$moveFile2D +ir/$file"); delete ($files{$file}); $processCount--; } elsif ($files{$file}{STATUS} eq 'F') { move ("$sourceFile/$file","$errorFiles +Dir/$file"); move ("$logsDir/$files{$file}{PROCESS} +.lock","$errorFilesDir/$file.err"); delete ($files{$file}); $processCount-- ; } } } print Dumper(\%files); #------------------------------------------------------------ sub mark2Load() { my $files = shift ; ##receives the hash as a reference foreach my $file (keys %{$files}) { #check if timestamp changed and if filesize doesn't. #if it happened, than we can try to load it. if ($files->{$file}{STATUS} eq 'A') { open (FH, ">>/$sourceFile/$file"); my $fileSize = -s FH ; if ( (time - $files->{$file}{TIMESTAMP +}) > 0 && ($files->{$file}{FILESIZE} == $fil +eSize ) ) { $files->{$file}{STATUS} = 'R' +; } close FH ; } } } #--------------------------------------------------------- sub catchKiller { my $sigName = shift ; makeLog("\nRecebendo sinal de kill $sigName as ". localtime() +); die "\n I've been killed $sigName"; } #-------------------------------------------------------- sub check_env() { open TEST1, $basedir or die "\n Can't open $basedir $!"; open TEST2, $sourceFile or die "\n Can't open $sourceFile $!"; open TEST3, $logsDir or die "\n Can't open $logsDir $!"; open TEST4, $loaderDcProgram or die "\n Can't open $loaderDcP +rogram $!"; die " $loaderDcProgram has no exec permission" unless ( -x TES +T4) ; close TEST4 ; close TEST1 ; close TEST2 ; close TEST3 ; open (STATUS, ">> $logsDir/LOADERSTATUS") } #--------------------------------------------------------- sub makeLog() { print STATUS shift ; $internalErrors++; print STATUS "'\nErrors $internalErrors process: $processCou +nt"; if ($internalErrors >= $maxInternalErrors) { print STATUS "\nkilling myself ($$) . too many errors. +" . " DEBUG info: " . print Dumper(\%files); die ; } } sub END { print STATUS "\n closing loader " . localtime() ; close STATUS ; } #-------------------------------------------------------- sub createInfoFile() { open (STATUS, "> $logsDir/STATUS"); print STATUS <<EOF; Last Update: @{[scalar localtime()]} process: $$ basedir: $basedir sourcefile: $sourceFile logs: $logsDir program: $loaderDcProgram move good: $moveFile2Dir move error: $errorFilesDir sleep time: $sleep max process: $maxProcess process count: $processCount time to kill child: $timeToKill Max errors: $maxInternalErrors internal errors: $internalErrors @{[Dumper(\%files)]} EOF close STATUS }

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://459113]
Approved by ww
Front-paged by ww
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (6)
As of 2019-10-16 04:16 GMT
Find Nodes?
    Voting Booth?