Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

My first package - need help getting started

by Limbic~Region (Chancellor)
on Feb 27, 2003 at 01:30 UTC ( #238963=perlquestion: print w/replies, xml ) Need Help??
Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

Warning! This is going to be a long post.

I recently read this post by Elian and two of things hit me like a ton of bricks:

  • If you're going to use it in more than two programs, make it a module
  • If doing it even once made your head hurt, throw it in a module

    I have to work with really nasty data. A typical record looks like the following:

    key : /C=US/A=BOGUS/P=ABC+DEF/O=CONN/OU=VALUE1/S=Region/G= +Limbi c/I=_/ type : UR flags : DIRM ADMINM ADIC ADIM Alias-2 : wL_Region Alias-3 : Limbic Region Alias-4 : Alias-5 : Limbic._.Region Alias-6 : Alias-7 : ORG Alias-8 : CMP Alias-10 : O=ORG/OU=Some Big Division/CN=Limbic _. Region Alias-11 : Region Alias-12 : Region, Limbic _ Alias-14 : Limbic _. Region at WT-CONN Alias-15 : ex:/o=BLANK/ou=ORG/cn=Recipients/cn=Mailboxes/cn=LRe +gion Alias-16 : WT:Limbic_Region Alias-17 : Alias-18 : /o=A.B.C.D./ou=Vermont Ave/cn=Recipients/cn=wt/cn=Li +mbic_Regi on Full Name : Region, Limbic _. Post Office : WT-CONN Description : 999-555-1212 Some Big Company Tel : 999-555-1212 Dept : Some Big Division Location : EMWS Address : 123 nowhere street City : everywhere State : MD Zip Code : 20874 Building : BLAH Building-Code : MD-ABC owner : CONN

    I am typically batch processing records, and not obtaining a single record. The problem is, that the only tools we currently have for this is shell scripts with a lot of VERY ugly sed. Even then, there are a great deal of limitations. I have written a few custom scripts in Perl, but they really aren't re-useable as each task is different.
    This seems like a perfect place to use a module.

    Here are some issues I can see up front:

  • Some entries can be on multiple lines, but if it is continued - it will always be indented by a tab
  • Some entries are guaranteed to be unique. Only the key is guaranteed to be present and unique
  • The key is / delimited, but there may be an imbedded \/ that is escaped that is not a delimeter
  • I will need the key both in concatenated format and broken down into each piece
  • Should I make this OO
  • These are the main things I will need to do:

  • Skip the record because it doesn't not match criteria
  • Manipulate the record based off criteria
  • Print the record

    Here is what I envision:

    #!/usr/bin/perl -w use strict; use DBParser; # My new module I haven't built yet open (OUTPUT,"> /tmp/somefile") or die $!; select OUTPUT; $|++; $/=""; while (<>) { my $key = DBParser::KeyGrab($_); my %record = DBParser::DBParse($_); next unless ($record{$key}->{Key}->{Surname} eq "Region"); $record{key}->{Tel} = "(123) 456-7890"; print DBParser::PrintRecord; }

    I certainly don't want anyone to write this module for me - the whole point in the project is to learn. I do however want pointers, advice, snippets of code, etc. Feel free to reply as though this was a meditation, if all you have to offer is methodologies and not actual technical solutions.

    Thanks in advance - L~R

    Replies are listed 'Best First'.
    Re: My first package - need help getting started
    by hv (Parson) on Feb 27, 2003 at 02:52 UTC

      This may seem like a red herring, but the first thing that worries me is your choice of name, 'DBParser'. The thing about OO is that it is about objects, so the first question to ask is "what is the object here?". My first guess, looking at the data, is that each record represents a person (or perhaps something more general - an "entity", perhaps), in which case I'd be inclined to call the object "Person" (though I'd probably avoid namespace problems by using a prefix that represented the company name, or perhaps my name or the project name, depending on the scope). An alternative approach, if this record format can be used to describe a variety of things, is to name the class instead after the name of the record format; in that case, it might be worth having subclasses for each of the major different types of thing that the format can represent.

      Now, what does the data in a typical file look like: is it multiple records each starting with the 'Key' attribute? If so, I could imagine wanting to write the code like:

      use Person; for my $person (Person->parse_from_file('/tmp/somefile')) { next unless $person->surname eq 'Region'; $person->tel('(123) 456-7890'); print $person->text; }

      Let me be clear: this is just how I like to write my code, and other people (including yourself) will doubtless have different prejudices. The above code assumes that Person::parse_from_file() knows how to read a sequence of records from a file, turn each one into a "Person" object, and return the resulting list.

      It also assumes that these objects are opaque, so that all access is via methods: you can choose to make them transparent hashrefs with documented keys, but then (for example) you always need to do the work to split up the 'Full Name' key so that the 'Surname' key will be there in case someone looks at it, and it probably means you can't allow modification both by way of the 'Surname' field and directly in the 'Full Name' field, because by the time you need to write the record back out you won't know which value is correct.

      I tend to like what are sometimes called "polymorphic get/set accessors", which means that you can use the same method either without arguments to fetch the value, or with an argument to set it to a new value. Some others prefer to split such functionality into two methods, eg tel() and set_tel().

      I'm sure there are many other aspects worth talking about, but these are just some initial thoughts.

        Thank you for your valuable input.
        I see that my lack of OO understanding has been blatently displayed. I also seemed to have used keywords that were clear to me, but not to everyone. What I am trying to accomplish is the following:

      • Turn a record (bunch of lines) into a complex data structure that can be treated as a single entity.
      • Have the ability to manipulate that complex data structure
      • Access that complex data structure for printing in the same ugly format that I created it with.

        This appears to be what you have gleaned from my poor attempt at explaining this. As far as I am concerned, I do not have a preference on how the code should look as I am completely inexperienced at this. I appreciate the information, but I really do not understand how to code the opaque objects as you suggest. I know that the full key will always be static, even if the broken out pieces change as it will be printed externally. If you could show me some code to illustrate this - I would be very appreciative. If not, what you have already done is appreciated.

        You do not have to use my data to create the opaque object - just show me a template to see the methodology. I am a fairly adept student.

        Cheers - L~R

          Ok, let's assume that the opaque object is implemented internally as a hashref, and that the fullname has a simple format of "surname, initials". Here's a simplistic approach:

          package Person; sub fullname { my $self = shift; if (@_) { $self->{fullname} = shift; } return $self->{fullname}; } sub initials { my $self = shift; if (@_) { $self->fullname(join ', ', $self->surname, shift); } (split $self->fullname, ', ', 2)[1]; } sub surname { my $self = shift; if (@_) { $self->fullname(join ', ', shift, $self->initials); } (split $self->fullname, ', ', 2)[0]; }

          In practice, I'd write it a bit differently: I'd probably have many methods very similar to fullname(), and might well generate them rather than write each one out explicitly. Also, I'd probably cache the derived information like surname and initials, to avoid recalculating them each time, in which case I'd need to be careful to decache that information when the source (fullname in this case) changed.

          I'm surprised that you don't want the module to parse the data for you, since that seems to be a chunk of code that you'd otherwise need to repeat everywhere you deal with these records. But likely I've misunderstood what you're trying to do.

          I guess the most important thing, which I should have said before, is that documentation is the key, particular in perl: the docs for your class will say how you're allowed to use the object, and what you're allowed to assume about it. And in general, anything that the docs don't say you are not allowed to do or assume when using the class or its objects in other code.

    Re: My first package - need help getting started
    by djantzen (Priest) on Feb 27, 2003 at 02:08 UTC

      This seems like a reasonable place for a module. One thing to note is that in your sample interface you're basically splitting the logic between the module and the calling code. That the caller controls opening and reading the file, pulling out a hash for each entry whose structure the caller must know in advance in order to read, indicates that this isn't a complete modularization of responsibilities.

      To my mind, the clearest way to fix these issues is to go down the OO path, in which an instance of DBParser opens a file to read, controls iteration internally, and returns results that match your search criteria for you to print from the calling context or to pass to another module specifically for formatting output. Doing this gives a clean interface, separation of duties, and the ability to create further refined subclasses of both the parsing and printing components.

      The difficult part in doing this will be specifying the search criteria since the data is pretty hairy, so it would be good to start with a review of all the ways current scripts access it, and see if there's a method to the madness that you can tease out and formalize.

      "The dead do not recognize context" -- Kai, Lexx
        After reading a few of these replies, I realize how much I really don't know about what I am getting into. I do not want the module to do the parsing, I just want it to create an object that I can then manipulate. Each program's needs are going to be different. In my very ficticious/contrived example I obviously used the wrong key word DBParser. It is really supposed to take a record and build an object. You have given me some food for thought as have others.

        Thank you - L~R

    Re: My first package - need help getting started
    by pg (Canon) on Feb 27, 2003 at 03:18 UTC
      Nice thinking, here is some thought I have.

      I see three classes here:
      1. Parser
      2. Filter
      3. Formatter
      The data would flow in this direction: Parser => Filter => Formatter.

      1. The Parser takes a stream of characters, and parses it into structured data.

        The Parser would have methods allow you to provide the input, which is the entity would be processed. It might be a file, might be a string...

        The Parser would parse the input into records (could be the same as lines), and each line into fields. You would allow the user to specify some criteria, and define how the records would be extracted, and then how to seperate each record into fields. Those criteria might regexps.

        For example, if we look at the sample data you give, you might want to make each line into a record, and within each record, the part before ':' is one field, the part after is another field.

        The Parser should also have methods allow you to fetch records, and fields, which would be used by the Filter.

      2. The Filter would accept structured data come out from the parser.

        You would allow the user to define the criteria as what would be threw away, and what can pass thru the Filter. Again, regexps might be a good fit here.

        The Filter does not modify the structure of the input data, but the number of output records could be less than the number of input records, if some records are threw away by the Filter.

      3. The Formatter would take the structured data, format them back to stream. Of course, the stream is formatted, and well presented.

        A good way is to allow the user define call back function, and the call back function would format those records, not your module, but your module might provide default format method, if one is not provided by the user.

      I am thinking it would be really nice, if you found a way to wrap around those well known HTML parsers, and XML parsers, and make them available to your Filter.

      One thing you may want to do, is to have a generic Filter class as the root, and have some generic method defined. Base on this, you then have some more specific Filters, for example, you may have one filter understand the output from a certain xml parser.

      You said that you didn't want someone to do it for you, but only want some ideas. The fact is nobody can do this for you ;-), quick and good.

        Spot on! - except the filter (or at least I think).

        I do not think (I could be wrong here) that the filter method should be part of the module. As in:

      • Stream of data is turned into object by module
      • Object is tested by program and possibly rejected
      • Object is possibly manipulated by program
      • Object is converted back to stream by module

        This allows the greatest flexibility over the filtration process as I do not know of all the ways it is currently being used, let alone all the way that it might be filtered on in the future.

        I really like the idea of having a default format method, but allowing it to be dynamic.

        This has really given me something to think about - would you mind critiquing some very bad code as soon as I get started? I have never built an object before, so I know my first attempt will be bad. If not - that is ok too.

        Thanks again and cheers - L~R

          ;-) Lots of time, you would see more than one design fly, and each of them is good. There is no black and white answer, and this is why computer science is both science and art.

          I agree that you can start with filter as part of your program, instead of a seperate module, but later if you see the functionality need to be reused, then abstract/extract a class out of your existing code.

          The traditional software engineering requires you to have everything laid out at the beginning, the design phase, and there is only one design phase. The modern software engineering does allow you to create your software cycle by cycle, each cycle is a whole traditional software life cycle, and has its own design phase. For each new cycle, new functionality would be added, and the design would be modified in a constructive way.

          This change of methodology is mainly because:

          1. people found there is no way that the traditional methodology would work for big systems/projects. It is simply impossible for people to get everything straight and right, once for forever.
          2. From a business view, companies some time want to be the first in the market. They have to prototype things, and quickly make their products available, worry more functions later.
          For sure, I would like to be one of the persons to do code review for you. By doing that, we can learn from each other.
    Re: My first package - need help getting started
    by zengargoyle (Deacon) on Feb 27, 2003 at 04:24 UTC

      yes, an object for your chunk-o-data. but if your stream-o-data isn't likely to change i would say no object for the parser.

      just have your object's creation method take a whole chunk-o-data.

      package ChunkOData; sub from { my ($class, $chunk) = @_; $chunk =~ s/\n\t//g; # continuation lines are easy # parse $chunk like you already know how # shove it into a hash return bless \%self, $class; } # write some accessors # write some common useful junk package main; local $/ = ''; while (my $chunk_text = <>) { my $chunk = ChunkOData->from $chunk_text; next unless $chunk->type eq 'UR'; $chunk->owner('me'); if ($chunk->is_a_certain_type) { $chunk->do_some_standard_thing; $chunk->do_something_else($with_my_info); subroutines_are_good($chunk); } $chunk->print; }

      if your stream-o-data is blank-line seperated (or other $/ -able format) this is a simple way to get started.

      you might also use one of the Order-keeping Hash modules from the CPAN in an object for your key field. then you could do something like:

      my $key = $chunk->key; next unless $key{OU} eq 'VALUE1'; my $otherkey = $chunk->key_as_string; # X=foo/Y=bar/..
        This is a great start - but I want to make sure I grok it before I try to use it, so I may have more questions.

        Thanks a million!

        Cheers - L~R

    Re: My first package - need help getting started
    by toma (Vicar) on Feb 27, 2003 at 04:29 UTC
      Your data appears to be LDAP data. I searched for LDAP on cpan and found 179 modules.

      Probably you don't need to write any new objects if you can use a few of these modules.

      Your LDAP data appears to be in LDIF format, which is covered in rfc2798. There is Net::LDAP::LDIF which may do exactly what you need, which is to turn LDIF text into a perl LDAP object.

      It should work perfectly the first time! - toma

        Thanks, but no dice. It is a flat file export of a very propietary database that has no public APIs. I will take a look at your references to see if they provide any insight into my own dilema though.

        Cheers - L~R

          All I had to remove the whitespace before the : characters and change the 'key:' field to 'dn:'. Instant perl objects!

          This code is almost identical to the code in the synopsis for Net::LDAP::LDIF. I just added a call to Data::Dumper to print the object and its structure.

          use strict; use warnings; use diagnostics; use Data::Dumper; use Net::LDAP::LDIF; my $ldif = Net::LDAP::LDIF->new( "file.ldif", "r", onerror => 'undef' +); while( not $ldif->eof() ) { my $entry = $ldif->read_entry(); if ( $ldif->error() ) { print "Error msg: ",$ldif->error(),"\n"; print "Error lines:\n",$ldif->error_lines(),"\n"; } else { print Dumper($ldif); } } $ldif->done();
          Here is the modified input file:
          dn: /C=US/A=BOGUS/P=ABC+DEF/O=CONN/OU=VALUE1/S=Region/G=Limbic/I=_/ type: UR flags: DIRM ADMINM ADIC ADIM Alias-2: wL_Region Alias-3: Limbic Region Alias-4: Alias-5: Limbic._.Region Alias-6: Alias-7: ORG Alias-8: CMP Alias-10: O=ORG/OU=Some Big Division/CN=Limbic _. Region Alias-11: Region Alias-12: Region, Limbic _ Alias-14: Limbic _. Region at WT-CONN Alias-15: ex:/o=BLANK/ou=ORG/cn=Recipients/cn=Mailboxes/cn=LRegion Alias-16: WT:Limbic_Region Alias-17: Alias-18: /o=A.B.C.D./ou=Vermont Ave/cn=Recipients/cn=wt/cn=Limbic_Reg +ion Full Name: Region, Limbic _. Post Office: WT-CONN Description: 999-555-1212 Some Big Company Tel: 999-555-1212 Dept: Some Big Division Location: EMWS Address: 123 nowhere street City: everywhere State: MD Zip Code: 20874 Building: BLAH Building-Code: MD-ABC owner: CONN
          It should work perfectly the first time! - toma
          That sample input looks almost exactly like LDAP. It may be an offshoot of DAP. Net::LDAP can almost access that data, it just a little more coaxing. Well, ok, not Net::LDAP, but the LDIF modules included with it. I really think this route should be investigated thouroughly.
    Re: My first package - need help getting started
    by jonadab (Parson) on Feb 27, 2003 at 09:14 UTC

      I have a couple of questions and, depending on your answers, a suggestion for how to simplify the problem.

      First, you said that the only thing guaranteed to be unique was the key, but you were talking about uniqueness among all the records. In your example, the field names are all unique within the record. Is that the case for every record? If so, it seems to me that a record can be conveniently represented as a hash.

      Second, it sounds to me from the description, though you don't really expressly say this, that you generally only need to look at one record at a time.

      If I'm understanding right here, then creating an object per se may be an unnecessary complication. It sounds to me like all you need is two functions: one that takes an open filehandle (as a glob maybe), reads off the next record, and returns a reference to a hash, and one that takes a reference to a hash and returns a string. Depending on what you need to do, another routine or several might be in order for testing records (e.g., a routine that takes a hashref and a string and returns the number of Alias fields in the hash whose values match the string).

      I know it's heresy to some to suggest not using OO where it's possible to use OO, but it just seems unnecessary here, to me.

      The only thing that makes me think I might be wrong, and that OO might in fact be a Good Idea, is that you didn't show what delimits records in the files you're reading. If there's no delimiter, then you are going to be reading until you get the key for the next record, which you then have to save for when you read that record. It is of course possible to do this without real OO, but it's awkward, since it involves a persistant variable (the one-line buffer) that needs to be associated with the specific file in question. If you never have more than one of these files open at the same time you could get by with a magic global ($main::MY_DB_PARSING_PERSISTENT_LINE_BUFFER or whatnot), but that's a kludge, and if you ever need to work through more than one of these files at the same time it will break. It is possible to get around that too, by using the filehandle as a key into a magic global hash, but now we're doing something arguably almost as complex as OO, so I'm not sure this really saves anything.

      But it is an option to consider. If your records are delimited by some magic marker in the files (e.g., a blank line), then this problem goes away, and you can just have a couple of routines, as I said.

      for(unpack("C*",'GGGG?GGGG?O__\?WccW?{GCw?Wcc{?Wcc~?Wcc{?~cc' .'W?')){$j=$_-63;++$a;for$p(0..7){$h[$p][$a]=$j%2;$j/=2}}for$ p(0..7){for$a(1..45){$_=($h[$p-1][$a])?'#':' ';print}print$/}
    Re: My first package - need help getting started
    by tachyon (Chancellor) on Feb 27, 2003 at 10:54 UTC
    Re: My first package - need help getting started
    by zengargoyle (Deacon) on Feb 27, 2003 at 17:05 UTC

      another non-OO way of doing things popped into my head. theres a module for processing NetFlow records ( module CFlow out of the flow-tools package, not on CPAN ) that does things like this:

      sub match_func { return 0 unless $bytes > 5000; return 0 unless $src_port == 80; # do_something with matched return 1; } CFlow::loop( \&match_func, $filehandle ); print "matched $CFlow::match_count records\n";

      since you generally work with a single record, if the fields of the record are unique... forget all of the OO stuff and use globals. the 'loop' routine takes a coderef to be run after each record is parsed (and shoved into the global variables) and a filehandle ( if filehandle is undef, read STDIN, if filehandle is a string, open and use that file.). the coderef returns 0 if the record wasn't interesting, else it does whatever and returns 1 (so the module can keep track of how many records matched).

      while not-00, it does do an excellent job of hiding the details from the user, eliminates all of the derefrencing ( $chunk->type() just becomes $type) which makes it easy to write quick one-off scripts.

      sub fix_building { return 0 unless $building eq 'FOO'; $building eq 'BAR'; print_rec; return 1; } DBParserThingy::loop ( \&fix_building );

  • Log In?

    What's my password?
    Create A New User
    Node Status?
    node history
    Node Type: perlquestion [id://238963]
    Approved by Paladin
    Front-paged by Enlil
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others scrutinizing the Monastery: (4)
    As of 2018-04-21 02:42 GMT
    Find Nodes?
      Voting Booth?