opposite of format?


laziness, impatience, and hubris
	PerlMonks

opposite of format?

by xChauncey (Scribe)

on Jun 27, 2003 at 22:08 UTC ( [id://269760]=perlquestion: print w/replies, xml )

Need Help??

xChauncey has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to move data from a large number of text files into a database. Each text file has the same format, with fieldnames and values, 21 lines long, where each line is unique (no nice columns or anything like that), for a total of about 80 pieces of data that I have to sift out and write to the db.

So here is the question...Is there some kind of function like the opposite of format that could just slurp up the file and assign all of the values to the appropriate variables? Or should I use a bunch of unpacks and just undef the data that I don't need (the fieldnames)?
Am I totally in the weeds here?

I'm pretty new to Perl (and programming), and don't want to accidentially write hundreds of lines of code to do something that could be done in much less time.

Feeling like I may have bit off more than I can chew,
xChauncey

Comment on opposite of format?

Replies are listed 'Best First'.

Re: opposite of format?
by dataDrone (Acolyte) on Jun 27, 2003 at 22:31 UTC

I think you're on the right track with unpack. Remember that you can use x in the templet to throw characters away so no need to undef them later. A couple questions to clarify, is you file arranged with one field name/value pair per line, and one file per record? Do the field-name / value pairs have a defined separator character or string? Can you show use an example of a file ( or the first couple lines or so ). Perhaps loading the file into an array (if the files are small in size) and then using split to break apart the name value pairs. Don't forget to filter the data for illegal characters that would cause problems when loading into the database or by sure to quote them properly.

Re: Re: opposite of format?

by xChauncey (Scribe) on Jun 27, 2003 at 23:26 UTC

LastName, FirstName  Home <hm phone>   field <value>
Address1             Work <wk phone>   field <value>
Address2                               field <value>
City, State, zip     field <value> field <value>           

field <value>      field <value>        field <value>
field <value>      field <value>        field <value>
field <value>      field <value>        field <value>

...etc...
[download]

Re: Re: Re: opposite of format?

by BrowserUk (Patriarch) on Jun 28, 2003 at 04:29 UTC

If the lines in your file are all the same length (even the blank ones) and each field is also fixed length then you can use the fact that unpack formats ignore whitespace in conjuction with the @nnn absolute position specifier to lay out the format in a manner that is relatively easy to write and later read.

By fixing the start of each line using @nnn and then using relative lengths and xxx to position stuff within each line greatly simplifies working out your format. In the example below, I've left the original 'specification' lines interleaved for readability and then used a regex to strip them out before using the format. There are several ways of doing this, and the much-maligned HEREDOC might actually be a better way. You could also omit them if you preferred.

Using qw// in conjunction with a hash slice to name the individual fields is a nice way of laying out the field names in a clear and concise manner. You can the assign a ref to the hash into another hash indexed by personnel id (or lastname etc.), giving you a HoH which is probably the best structure for this kind of work.

my $unpack_format = 
'
@0  A8      xxA9       xxxxxxxxA8      xxxxxxxxxxxa5
    LastName, FirstName  Home <hm phone>   field <value>
@53 A19                xxxxxxxxA8      xxxxxxxxxxxa5    
    Address1             Work <wk phone>   field <value>
@105A38                                   xxxxxxxxA5
    Address2                               field <value>
@157A4  xxA5   xxA8     xxxxxxxxA5   xxxxxxxxxA5    
    City, State, zip     field <value> field <value>    

@261xxxxxxxA5   xxxxxxxxxxxxxxA5   xxxxxxxxxxxxxxxxA5
    field <value>      field <value>        field <value>
@313xxxxxxxA5   xxxxxxxxxxxxxxA5   xxxxxxxxxxxxxxxxA5
    field <value>      field <value>        field <value>
@365xxxxxxxA5   xxxxxxxxxxxxxxA5   xxxxxxxxxxxxxxxxA5
    field <value>      field <value>        field <value>
';

## remove the "documentation" lines.
$unpack_format = s[^\s.*?\n][]mg; 

my %personnel;

# Assuming that the Personnel number is an integral part of the filena
+me
 
for my $filename ( <PID*.rec> ) {
    my %record;
    @record{ qw[
        lastname firstname homephone field1
        address1 workphone field2
        address2 field3
        city state zip field4 field5 field6
        field7  field8  field9
        field10 field10 field11
        field12 field13 field14
    ] } = unpack $unpack_format, slurp_file( $filename );
    
# Slurp file could be a function or do{ local (*ARGV, $/) = $filename,
+ <> };
# Search for [Juerd]s "Cheap idioms" node for details
    
    $personnel{ $filename =~ m[PID(\d+)\.rec] } = \%record;
}
[download]

However, I strongly suspect that lines like

City, State, zip field <value> field <value>

nn your specification don't indicate that City is a 4-byte field or State 5-bytes, but that within the overall 21-byte field allocated to them, the parts are variable length seperated by commas.

unpack has no mechanism for dealing with this variable-width-within-a-fixed-length-field type of data. You would need to unpack the 3 as a single field and then use split or a regex to subdivide it later, which is a pain.

In this case, I would probably go for a big-regex, though that doesn't mean it has to be complicated or hard to write. And using the /x modifier and embedded comments, it can become self documenting. Using your record specification as a starting point, and then interleaving the elements of the regex lined up as best you can with those lines makes for a reasonably readable layout. The need to add \s(*|+) between the elements when using /x has the effect of making the regex look 'noisy', but it's a trade off against the self-documentation provided by the embedded comments and alignment. Overall I find this quite readable, but YMMV.

my $re_record = qr[

#LastName, FirstName  Home <hm phone>   field <value>
([^,]+)  ,([^ ]+)   \sHome <([^>]+) >\s+field\s<([^>]+)> \s*\n

#Address1             Work <wk phone>   field\s<value>
(.*)                  Work <([^>]+) >\s+field\s<([^>]+)> \s*\n

#Address2                               field <value>
(.*)                                    field\s<([^>]+)> \s*\n

#City, State, zip     field <value> field <value>          
([^,]+),([^,]+),(.*)  field\s<([^>]+)>\sfield\s<([^>]+)> \s*\n

\s*                                                         \n    

#field <value>      field <value>        field <value>
field\s<([^>]+)> \s+field\s<([^>]+)> \s+ field\s<([^>]+)> \s*\n

#field <value>      field <value>        field <value>
field\s<([^>]+)> \s+field\s<([^>]+)> \s+ field\s<([^>]+)> \s*\n

#field <value>      field <value>        field <value>
field\s<([^>]+)> \s+field\s<([^>]+)> \s+ field\s<([^>]+)> \s*\n

]x;

my $record = slurp_file( $filename );

my %record; @record{ qw[ .... ] } = ( $record =~ $re_record );

$personnel{ .... } = \%record;
[download]

You could use a repeat count for the last 3-lines/9-fields, but if the data has a fixed number of fields, I think that using cut&paste makes things clearer in this instance.

One tip: If you decide to go the 'big regex' route, start by commenting out everything but the first line, capture to an array and print the results. Once you have that capturing the right things, uncomment the second and repeat. Simple advice, but it took me a while to work it out.

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

[reply]
[d/l]
[select]

Re: Re: Re: Re: opposite of format?

by xChauncey (Scribe) on Jun 29, 2003 at 06:58 UTC

Re: Re: Re: opposite of format?

by dash2 (Hermit) on Jun 28, 2003 at 10:42 UTC

my %fields = split(/\s+/, $line);
[download]

for each line. For the earlier bits, if you know that whitespace will be in the right place, you could do:

my ($last_name, $first_name, undef, $home_phone, %fields) = split(/\s+
+/, $line);
[download]

A massive flamewar beneath your chosen depth has not been shown here

[reply]
[d/l]
[select]

Back to Seekers of Perl Wisdom

Log In^?

Domain Nodelet^?

www.com | www.net | www.org

Node Status^?

node history
Node Type: perlquestion [id://269760]
Approved by BazB
help

Chatterbox^?

How do I use this? • Last hour • Other CB clients

Other Users^?

Others chilling in the Monastery: (5)

As of 2024-04-23 13:24 GMT

Sections^?

Information^?

Find Nodes^?

Leftovers^?

Today I Learned

Voting Booth^?

No recent polls found