Beefy Boxes and Bandwidth Generously Provided by pair Networks RobOMonk
Perl: the Markov chain saw
 
PerlMonks  

Re^3: Reading tab/whitespace delimited text file

by BrowserUk (Pope)
on Oct 22, 2012 at 06:31 UTC ( #1000265=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Reading tab/whitespace delimited text file
in thread Reading tab/whitespace delimited text file

Yuck! I thought (hoped) that this type of file format -- mixed, fixed-format records -- had died long ago; but they seem to keep reinventing it :)

For your first example, the trick is to define a regex that will match the fields in the header line:

my $reHeader = '(\b\w+\s*)?' x 10; ## Adjust the repeat value to cover + the maximum no of fields

and use that to construct an unpack template to parse the following values line.

This is not 'nice code', but it demostrates the technique:

#! perl -slw use strict; use Data::Dump qw[ pp ]; my $reHeader = '(\b\w+\s*)?' x 10; my %data; until( eof( DATA ) ) { ## Read the header line and remove the newline chomp( my $header = <DATA> ); ## parse the fields using the regex, ignoring undefined fields my @keys = grep defined, $header =~ $reHeader; ## trim the trailing whitespace from the keys s[\s*$][] for @keys; ## Use the capture position arrays (@- & @+) ## to work out the field widths and construct a template my $tmpl = join ' ', map{ defined( $-[$_] ) ? do{ my $n = $+[$_] - $-[$_]; "a$n" } : () } 1 .. $#+; ## read and chomp the values line chomp( my $vals = <DATA> ); ## Extract the value fields using the template my @vals = unpack $tmpl, $vals; ## trim leading & trailing whitespace s[^\s*][],s[\s*$][] for @vals; ## Add the key/value pairs to the hash @data{ @keys } = @vals; ## discard the blank line between the grouped pairs of lines. <DATA>; } pp \%data; ## display the hash constructed __DATA__ TRHYST TROFFSETP TROFFSETN AWOFFSET BQOFFSET 2 0 5 3 HIHYST LOHYST OFFSETP OFFSETN BQOFFSETAFR 5 3 0 3 CELLR DIR CAND CS LUC083A MUTUAL BOTH NO

Outputs:

C:\test>junk79 { AWOFFSET => 5, BQOFFSET => 3, BQOFFSETAFR => 3, CAND => "BOTH", CELLR => "LUC083A", CS => "NO", DIR => "MUTUAL", HIHYST => 5, LOHYST => 3, OFFSETN => "", OFFSETP => 0, TRHYST => 2, TROFFSETN => "", TROFFSETP => 0, }

Extending that to apply it to all your other sections will require a little ingenuity and a lot of painstaking testing.

I do hope for your sake that the number and ordering of the different sections is well-defined, else you've got an even worse task on your hands.

Note:This assumes that field names do not contain spaces. If they do, you are in shit street.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

RIP Neil Armstrong


Comment on Re^3: Reading tab/whitespace delimited text file
Select or Download Code
Re^4: Reading tab/whitespace delimited text file
by reaper9187 (Scribe) on Oct 22, 2012 at 06:55 UTC
    Hi,
    I can't thank you enough for the help. i know it looks pretty messy but the good part is i don't need to read every section (thank god for that.!). I would have been in deep shit otherwise. Anyways, thanks for the heads up.
Re^4: Reading tab/whitespace delimited text file
by reaper9187 (Scribe) on Nov 01, 2012 at 12:38 UTC
    why is the code not able to read the following ???
    CELL LUC325C CELLR DIR CAND CS LUC325B MUTUAL BOTH NO KHYST KOFFSETP KOFFSETN LHYST LOFFSETP LOFFSETN 3 0 3 0 TRHYST TROFFSETP TROFFSETN AWOFFSET BQOFFSET 2 0 5 3 HIHYST LOHYST OFFSETP OFFSETN BQOFFSETAFR 5 3 0 3
    The value for cell key should be LUC325C but i keep getting LUC3.. thats it ..!! help appreciated ..!!

      Because in order to find the length of each field in this crazy format, it relies on the length of the header field to determine the length of the value fields.

      But, (uniquely) in the case of:

      CELL LUC325C

      The header field is shorter than the value field.

      And as that record pair was not a part of the sample you showed when you originally asked this question, the code does not cater for it.

      Adding this line to the code above will allow it to handle that part of the data format:

      $tmpl =~ s[\d+$][*];

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      RIP Neil Armstrong

        Thank you soooo much for helping .. i'm a newbie to perl and figuring out everything as i go on ...
        The code works perfectly ..!!!!! However , while this works for a sample data , when i use it parse the actual text file , it behaves weirdly .. This is what i get for the sample data :
        { AWOFFSET => 5, BQOFFSET => 3, BQOFFSETAFR => 3, CAND => "BOTH", CELL => "LUC325C", CELLR => "LUC232A", CS => "NO", DIR => "MUTUAL", HIHYST => 5, KHYST => 3, KOFFSETN => "", KOFFSETP => 0, LHYST => 3, LOFFSETN => "", LOFFSETP => 0, LOHYST => 3, OFFSETN => "", OFFSETP => 0, TRHYST => 2, TROFFSETN => "", TROFFSETP => 0, } Press any key to continue . . .

        And this is what i get when i execute it on a text file :
        { CELL => "LUC325C" } { BOTH => "", LUC325B => "", MUTUAL => "", NO => "" } {} { BQOFFSETAFR => "", HIHYST => 5, LOHYST => "3 0", OFFSETN => "", OFFSETP => 3, } { BOTH => "", LUC116A => "", MUTUAL => "", NO => "" } {} { BQOFFSETAFR => "", HIHYST => 5, LOHYST => "3 0", OFFSETN => "", OFFSETP => 3, } { BOTH => "", LUC204A => "", MUTUAL => "", NO => "" } {} Press any key to continue . . .

        Any idea as to why this behaves differently for the same set of data ?????

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1000265]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (11)
As of 2014-04-24 09:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (565 votes), past polls