http://www.perlmonks.org?node_id=542862

hmbscully has asked for the wisdom of the Perl Monks concerning the following question:

I support an html form that I inherited six years ago. The form is around nine years old. A the time of its creation, the submissions were low enough (a few thousand a year) that the choice was made to have the form (with an insane amount of client-side javascript validation) use a simple perl script and email each submission as a discrete message to be dealt with, essentially by hand on the receiving end. The other alternative would've been build a java web application with an Oracle backend, but it was decided that was prohibitively expensive at the time based on the volume of submissions.

Fast forward to now and the submissions to the form (which essentially remains unchanged except for a flatfile backup that I put in a year or so because we were losing some emails) for the first six months of this fiscal year alone are over half a million, still being processed in the same old way! The client has finally realized that this cannot continue. There has been the promise of an Oracle backed web-based application solution from IT for several years now but the realization of that solution appears to be still several years off.

I should mention that I am not part of the IT dept. I work in our web department as the web technician and maintaining perl scripts is only part of my job. Because I am not part of the IT bureucracy, I am often asked by clients to come up with creative solutions not involving massive IT projects to help them fill the gaps until larger solutions come into place. This is one of these such issues.

What the client wants is for me to change the script so it doesn't send emails anymore but instead writes each request to a fixed-length text file that will accumulate the requests and then be passed to them once a day. They will import into Access (or something else I'm not entirely sure or responsible for the import part) and do what they need to do with the data. I guess the key point is that I do not have access to a database and I'm working in flatfiles.

I'm not worried about the writing the properly formatted file part. My concerns are arising when I consider the volume of requests this form will handle and a new request/requirement that the file, which has sensative personal information in it, must be encrypted at all times.

I've started doing reading on GPG and the modules that exist to support that need. But before I completely commit to this work (not surprisingly this isn't the only project I'm working on), I'm trying to figure out if this is even a valid task that I'm attempting to do. It seems to me that if the form is getting several thousand requests a day, that locking the file, unencrypting, writing the new data to the file, reencrypting, and unlocking the file for every single submission may simply not work. That I'm going to lose data.

My personal level of perl expertise I'd put somewhere around advanced beginner. I have a degree in comp sci and I'm entirely self-taught in perl and I'm the only person in my company who works in perl. I love the language but because its not the only thing I do, I don't have the time to spend trying to advance my skills unless its for a specific project that requires me to do so. I'm saying this just because I'm realizing that a lot of my code might not be written in the best way to optimize performance. I wonder if I'm hesitating on this project because of my lack of knowledge on how to write optimized code or if my spidey sense is tingling because this is a Bad. Idea., regardless of my level of expertise, and I need to tell the client that the html form/perl script/flatfile/no database solution has finally come to the end of its usefulness and they have to deal with this another way.

My questions are:

  1. Is this a bad idea in the first place?
  2. If so, what's a good way to explain this to the client in layman's terms?
  3. Or is it only a bad idea when putting the GPG issue into the mix?
  4. If its not a bad idea, what GPG modules are suggested for use? I've been looking and there seem to be a lot. How do I figure out which ones are the better ones?
  5. If its not a bad idea, do I still need to flock the files before/after I encrypt/unencrypt?

many thanks!

  • Comment on Bad Idea? Questions of performance issues, file locking, and GPG

Replies are listed 'Best First'.
Re: Bad Idea? Questions of performance issues, file locking, and GPG
by BrowserUk (Patriarch) on Apr 12, 2006 at 16:18 UTC

    There will probably be better suggestions from the guys here with web experience, but it strikes me that you are tackling this, or rather you have been asked to tackle this at the wrong end.

    Rather than changing the data collection end, which already works, and getting into locking and all the stuff, it would be much simpler to utilise the natural and effective serialisation that the email queueing does for you. Leave the current mechanism in place and write a script to service the email account via pop3 or whatever, and place the data into the flatfile with whatever encryption is appropriate.

    Depending on your version of windows, you could probably just set the encryption property on the flatfile, run the email to flatfile perl program under a secure account and the system would take care of encrypting it for you.

    If the flatfile to Access program is run under that same account, the decryption will be taken care of transparently also.

    Just a notion.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I like the idea, but what I failed to mention is that the email data collection actually doesn't work all the time. We lose hundreds of messages in gaps from time to time in some problem that IT has never been able to fix. This is part of the push to go to a flatfile only system and get out of the email business.

        Is this in a LAN or WAN or internet connected environment?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Bad Idea? Questions of performance issues, file locking, and GPG
by kvale (Monsignor) on Apr 12, 2006 at 16:18 UTC
    I have written pre-forked server applications that have many children reading and writing to a flat-file database using file locking and it can add ten of thousands of entries a day on a poky Pentium III laptop. So I don't think the file locking will be a bottleneck.

    The encryption/decryption cycling may be a little slow, depending on the size of your database. One alternative is to write each entry as its own encrypted little file, and write another perl program to read all these little files in at the end of the day to produce a big encrypted file.

    -Mark

Re: Bad Idea? Questions of performance issues, file locking, and GPG
by jdtoronto (Prior) on Apr 12, 2006 at 16:25 UTC
    I agree with BrowserUk, you are being asked to deal with this totally backwards!

    If the volume of submissions is that high then they should contemplate involving IT and have the form submit into a database, then the data can be dealt with there through another interface, or it could be periodically downloaded to the client application.

    Typically I have handled this by exporting the data using another small script either into an SQL dump, or a CSV file, zip it up with a password and send it (either by direct download or by email). But have the client contemplate that the better solution - and a less time intensive one for their staff - may well be to approach the problem top down, rather than trying to implement a patchwork solution.

    jdtoronto

Re: Bad Idea? Questions of performance issues, file locking, and GPG
by ptum (Priest) on Apr 12, 2006 at 16:40 UTC

    Half a million records for a six-month period suggests that you are bringing in less than 3000 submissions a day, on average. This is not a particularly large number from a flat file perspective -- I wouldn't worry too much about optimization unless you have a practice of really going out of your way to make your code inefficient. :)

    I know you've said you don't have access to a database, but sometimes people say that when they mean "I don't have access to an instance of Oracle" or whatever the company uses for production data. Have you considered MySQL? It is pretty easy to install, and has most of the power of a 'full-fledged' commercial relational database. Encryption/decryption aside (because I know very little about such things), having your data stored in a relational database makes the extraction of that data in interesting ways much easier.

    Rather than blindly hand over the CSV file or whatever to your customers, I would go a step further and ask, "What do you do with this information?" I find that many times there is some simple task I can do while I still have the data that makes life much easier for my customers.


    No good deed goes unpunished. -- (attributed to) Oscar Wilde
      Yes, we've considered MySQL... basically we've begged on our knees to be given any database since we aren't allowed access to our Oracle system, but IT says no dice. They don't know MySQL and will not put it on our system no matter what proof to its benefits we provide. I don't have the admin rights on our systems to do it myself. I'm looking to do this stop-gap solution for now. I appreciate the replies and outside of the box suggestions, but I want to assure you all that I am unfortunately extrememly familiar with the data and what the user is doing with it all. I am glad to know that the file locking isn't an issue. Honestly, I never really thought it was but that enough non-perl people have been questioning me on it that I began to doubt my previous thoughts.
Re: Bad Idea? Questions of performance issues, file locking, and GPG
by wfsp (Abbot) on Apr 12, 2006 at 17:20 UTC
    Perhaps have a look at the excellent DBM::Deep?

    The docs discuss locking and encryption issues and there is an export method. It is also pure Perl so it is very easy to set up.

    There is a discussion on "speed" at the end which includes the statement "At 3,000 keys, avg. speed is 1,982 keys/sec". As the author says, it is "pretty fast". :-)

    Hope that helps

Re: Bad Idea? Questions of performance issues, file locking, and GPG
by LanceDeeply (Chaplain) on Apr 12, 2006 at 16:38 UTC
    just curious, what happend to the: ... flatfile backup that I put in a year or so because we were losing some emails ... ?

    Is that essentially the same as the file you are trying to write out now, except not encrytped?

    I think kvale's suggestion of storing each submission to a separate encrypted file and then running and end of day job to join them up and send them will be easiest to implement.
      Its kind of basically the same file, but yes, the encryption thing is the sticking point. I have no qualms about creating a flatfile. I'm going to investigate the idea of separate encrypted files and the Deep module. Thanks!
Re: Bad Idea? Questions of performance issues, file locking, and GPG
by eric256 (Parson) on Apr 12, 2006 at 16:55 UTC

    To add to what the others have said. Try first, then if there are bottlenecks deal with them. It sounds like your probably worried that locking, decrypting, appending, encrypting, unlocking is going to take to long. If it turns out that it does, then you could always just encrypt the portion you are adding and append that. At the end of the day a script processes the file decrypting each section to build the final CSV which is then encrypted all at once. There are lots and lots of options available to you but its usualy best to try it before deciding its a bottle neck ;)

    BTW The point of locking the file is that you wont loose data no matter how many people submit. Just if you have a surge some of those people will have to wait for early processes to finish with the file. If your process took a full second that would mean that if 30 people all hit submit at once then the last one waits 30 seconds. Sounds bad except your process probably wont take a full second and even if it does you will probably never have 30 people hit submit in the same second. Some quick ( and probably wrong math) indicates that you receive .03 submissions per second or about 1 submission every 30 seconds, so I think you'll be okay ;)


    ___________
    Eric Hodges