http://www.perlmonks.org?node_id=57110

Stamp_Guy has asked for the wisdom of the Perl Monks concerning the following question:

Here's one that's been puzzling me for a while:
I have a 10,100 item flat-file database. It is pipe deliminated. The first item is a catalog number. I need people to be able to pass two arguements to the script that tell the first and last catalog number that they want to see and have the script print out everything in between - often several hundred results. The main problem is that many entries have the same catalog number but different other fields. I need all the records that contain the first catalog and last catalog to be displayed with the rest.
So basically what I've done so far is used a control number system - numbered the lines of the database and used those numbers instead of the catalog numbers as arguements. This is extremely cumbersome when I have to add or delete an item from the list. Is there any easier way that you guys can think of to do this?

Replies are listed 'Best First'.
Re: An odd flat file database question
by eg (Friar) on Feb 08, 2001 at 06:46 UTC

    I've read through this post a couple of times, and I don't quite see what your question is.

    If you've got a 10k line flat-file, I think it's time for you to consider moving to a more flexible and robust database system. Mysql, PostgreSQL and the venerable Berkeley DB system are popular free (at least as in beer) options for you.

    You "control number system" is essentially the unique primary key in your database. It's a pretty standard practice. In what way do you find it "cumbersome". If it's a problem with re-keying your database every time you add or remove an item, my advice is don't. When you remove an item, that primary key is gone forever, when you add an item, just increment your max primary key. Save a copy of the max primary key somewhere else so you don't need to run through your entire database to find out what it is. Since your primary query is by catalog number, keep your database sorted by that. (If I've guessed your problem wrong, then I apologize. :)

    It also sounds like you have a database normalization problem that you'll want to address eventually, but I guess that can wait.

    Update: sounds like you want to learn how to tie an array ... maybe check out DB_File.

      Thanks for your information about the database normalization. I'm reading that right now. I am a total novice in database programming. I have never even messed with any SQL stuff before and have dealt solely with flat file databases. I would really like to learn how to use SQL databases but I really don't know where to begin learning. Can you guys point me in the right direction? Also, the server this site I'm working with is hosted on has no useful documentation. How in the world do I even find out whether there is an SQL server installed on it. Pardon all my ignorance I'm kind of a newbie.
        Chances are, if this is a closed system and you have no idea if there is a SQL server somewhere hidden, then there probably is no SQL server. If you can, download MySQL and the docs will be installed along with it. You might start also here in our Tutorial section on DBI. I performed a simple web search on google for "MySQL tutorial" and came up with this helpful link. With so many entries, it's time to upgrade, otherwise you're not only wasting your own time, but precious processing time.
        AgentM Systems nor Nasca Enterprises nor Bone::Easy nor Macperl is responsible for the comments made by AgentM. Remember, you can build any logical system with NOR.
Re: An odd flat file database question
by merlyn (Sage) on Feb 08, 2001 at 06:30 UTC
    I'm trying to think of a way in which you would not get the endpoints of a range. Perhaps if you showed some code, we could help more.

    -- Randal L. Schwartz, Perl hacker

      If there are 10 records with the final catalog number, I can easily see naive code only printing one of them.
      Hey Merlyn, My code accepts two arguements: First and Last. It just goes through the database and when it finds the first, it starts a loop that prints and increments a counter until it reaches the last number. It then prints the footer and exits. Really simple, really bulky, really sloppy, and a pain in the rear end to update. Also, I'm reading in the entire file into an array each time I do this. The file is about 250k. Do you think that it is pushing it to do that? Is there any quicker, perhaps more memory efficient way to do it? I'd give you the code, but I really don't think it will do any good after explaining what I did above. Let me know.

        There are a couple, but they all involve changing the format in which your data is stored. As others in this thread have mentioned, you can use DB_File and its ilk, or (my recommendation) an RDBMS like MySQL or Postgres (both available for the cost of a download, with plenty of online help). Perl's DBI interface is robust and mature, so you shouldn't have too much trouble making the transition. It can't hurt to know some SQL =)

        Philosophy can be made out of anything. Or less -- Jerry A. Fodor

Re: An odd flat file database question
by coolmichael (Deacon) on Feb 08, 2001 at 08:27 UTC
    if you assume that $first and $last are the first and last catalog numbers the person wants, you could do this:
    open FILE, "somefile.txt" || die "Blah: $!\n"; while(<FILE>){ m/^(\d+)\|/; print if(($first <= $1) and ($1 <= $last)); }
    I'm not sure the regex is syntatically right. It's supposed to match all digits from the start of the line to the first pipe. it isn't tested, but I think it will work.

    Michael.

Re: An odd flat file database question
by Anonymous Monk on Feb 08, 2001 at 19:21 UTC
    This is not a nice way of doing it, but would ensure that every correct catalog number is caught. The trick is to run a loop going through every line (yeah ... all 10,100) and comparing the number to the range wanted. The loop wouldn't exit at the first out of range catalog number. It's not very neat but it gets the job done.

    Alternatively if the catalog is in order (ie: the file is ordered by ascending catalog numbers), you could exit the loop when the catalog number is equal to $last+1. Thereby catching all the references to $last.

Re: An odd flat file database question
by ftforger (Sexton) on Feb 09, 2001 at 00:38 UTC
    While your db may have other problems (which others have made suggestions about) I'll answer with one method of doing what you are asking for. Load your file into @array. Then start from the top of the array looking for a match on your first key. The first match is the starting index for your output. Next, use a reverse for loop, ie start at the last line of the array and decrement by one, until you get a match. That is the last line you want to output. Then simply print out everything from your first index to your second index. A quick fix that will hold off the dogs while you are learning to program a real dbi.
Re: An odd flat file database question
by Anonymous Monk on Feb 09, 2001 at 01:41 UTC
    If you have no access to any kind of database (eg:MySql) you could use hash tables. Otherwise: 1.Create two hash tables A & B. Let A contain Catalog# as its key and frequency as its value. Let B contains "Catalog#|seq#" as its key and pipe delimited fields as its value (String). seq# <- The frequency (incremented value) in hash "A" . 2. When a user enters the range of catalog#s: You may use the sorted list of key values from hash "A" to get all unique catalog#s and its corresponding frequency. And utilize a small loop to fetch all the records user wanted from hash "B". (Note: It may take time to load your data into hash table.) Uma. (umasuga@home.com)
Re: An odd flat file database question
by 2501 (Pilgrim) on Feb 08, 2001 at 07:17 UTC
    I am a little loopy tonight
    are you looking for something like:
    while(<catalog>){ print "$_\n" if(/$catnum/ ... /$catnum2/); }
      See Re (tilly) 2: An odd flat file database question for an explanation of the bug in your code. In any case the problem is a data structure issue. Keeping a sorted list may be convenient for spitting data out, but is inconvenient when it comes to inserts and deletes. The simplest answer in this case is to use a BTree with DB_File setting the R_DUP flag on. You cannot get the features you need through the tied interface, but the OO interface gives you the ability to quickly access by catalog ID, insert and delete, and have duplicate entries.

      Of course if the complexity of the problem grows a more sophisticated database would be a better fit...

Re: An odd flat file database question
by Stamp_Guy (Monk) on Feb 10, 2001 at 23:31 UTC
    Seems that everyone is saying "GO SQL!", so I've started to do some research into it, but I've got a few questions:

    • Is is possible to set it up so I could develop the program on my Win98 desktop and have it completely portable to a Linux server with minimal changes? Basically: is it cross-platform?
    • How hard is it to learn SQL?
    • Should I get a really good handle on Perl and Linux before I start messing with any SQL stuff?
    • Is there any natively supported database system that Perl uses? I've been reading a book called Teach Yourself Perl 5 in 24 Hours (yea right, 24 hours!) and it had a little example of a program that uses DBMs. Where could I find some more information in newbie language for that?
    As you can see, this newbie's got tons of questions. Perhaps some of these questions are stupid, but I really want to learn. I appreciate all the help I'm getting from the people here! Thanks guys!
      First and foremost, do not confuse SQL as a language or database format. While it has the word "language" in its name, it's really a standard for obtaining, and futzing with data across many different development tools (languages) and database formats/Products (MySQL, Oracle, Sybase, PostGres, Excess, xBASE, Paradox, yadda, yadda, yadda).

      SQL is a standard for manipulating and defining databases. It has two types of statements:

      • Data Manipulation Language (DML): SELECT, UPDATE, INSERT, and DELETE.
      • Data Definition Language (DDL): CREATE, DROP, ALTER, GRANT.

      Each statement is aptly named and operates against an object that generally makes sense.

      A decent overview will help. (See below)

      In your case, you're looking for something along these lines:

      select c.* from Catalog c where ( c."CatalogId" < $lovalue ) AND ( c."CatalogID" > $hivalue )

      This returns all records from the Catalog table that have Catalog ID's in the range specified by the $lovalue and $hivalue variables. (The specific syntax will depend greatly on the database format you choose. For example MySQL has slight syntax variations over Oracle and ODBC.)

      With most DBI drivers, this will give you an array of matched records, that you can then print very quickly using whatever approach you're most comfortable with.

      To gain the cross platform success you're looking for, you need to choose your database format with care. I recommend starting with mySQL, for though it doesn't completely offer the features of a complete client/remote server database format, it is cross platform, easy to install, and free. Should you eventually outgrow it, you should be able to convert your SQL to your new format with little to no trouble.

      Like chess, the basics of SQL are easy to learn. Those eight statements I listed earlier are (arguably) pretty much the heart of SQL itself. It gets a little harder when you start learning the extensions and features of your database format. For example, many remote database servers support stored procedures and triggers, which are (essentially) chunks of code run by the database server to validate (and respond to) your data values and what the users does with them. This can be quite daunting, but it can be very powerful. Learn it as you can.

      Personally, I wouldn't worry about mastering Linux before tackling SQL. Get the Cheetah book; it'll help a great deal. It goes into the basics, offers a decent SQL tutorial, and walks you through the process very carefully. Also, it provides a great reference to various DBI drivers, which will help you make the most informed decision regarding the file format.

      As far as the latest AM advice is concerned, she (assumed) was offering advice on how to similate the SELECT statement without using SQL.

      Assuming that you're storing your data in a hash, she suggested building a second hash that sorted all records in order of your desired sort order. This hash contains two values: the value you're sorting on as the key and a pointer (or reference) to the "real" database record in the original hash. (Note that if you have multiple matches to the same catalog number, the second hash element should be an array of references to each unique record.) That way, you can quickly determine which set of records you're after, simply by examining the keys you're dealing with.

      --f