Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Reading tab/whitespace delimited text file

by BrowserUk (Pope)
on Oct 21, 2012 at 20:46 UTC ( #1000244=note: print w/ replies, xml ) Need Help??


in reply to Reading tab/whitespace delimited text file

A space delimited file with spaces as fillers and absent fields? If so, you've got a nasty problem on your hands.

If on the other hand, the fields are tab separated and space filled, that is a much simpler proposition. (But that's not what I see when I c&p your sample.)

Is that the entire file or just one section? If the latter, you really need to show us at least 2 or 3 sections so we can see what separates them.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

RIP Neil Armstrong


Comment on Re: Reading tab/whitespace delimited text file
Re^2: Reading tab/whitespace delimited text file
by reaper9187 (Scribe) on Oct 22, 2012 at 05:31 UTC
    Thanks a lot for helping me
    As i said earlier, the above code is only a section of the entire file. There are multiple such sections

    SCTYPE SSDESDL QDESDL LCOMPDL QCOMPDL UL 90 30 5 55 BSPWRMINP BSPWRMINN 20
    . .
    CELL SCTYPE LWACH1A ACTIVE CHTYPE CHRATE SPV LVA ACL NCH YES BCCH 1 A3 1 SDCCH 0 A3 15 TCH FR 1 0 A3 13 TCH FR 2 0 A3 13 TCH FR 3 0 A3 13 TCH HR 1 0 A3 26 TCH HR 3 0 A3 26 CBCH 0 A3 1
    . .
    CELL LOL LOLHYST TAOL TAOLHYST LUC082A 120 3 61 0 DTCBP DTCBN DTCBHYST NDIST NNCELLS 4 2 10 1
    . .
    ACTIVE CHTYPE CHRATE SPV LVA ACL NCH YES BCCH 0 A3 0 SDCCH 0 A3 0 TCH FR 1 16 A3 32 TCH FR 2 0 A3 32


    Again the file is pretty large and i cannot mention all of the formats. Just need to get an idea on how to do it. I can then extend it over the entire file

      Yuck! I thought (hoped) that this type of file format -- mixed, fixed-format records -- had died long ago; but they seem to keep reinventing it :)

      For your first example, the trick is to define a regex that will match the fields in the header line:

      my $reHeader = '(\b\w+\s*)?' x 10; ## Adjust the repeat value to cover + the maximum no of fields

      and use that to construct an unpack template to parse the following values line.

      This is not 'nice code', but it demostrates the technique:

      #! perl -slw use strict; use Data::Dump qw[ pp ]; my $reHeader = '(\b\w+\s*)?' x 10; my %data; until( eof( DATA ) ) { ## Read the header line and remove the newline chomp( my $header = <DATA> ); ## parse the fields using the regex, ignoring undefined fields my @keys = grep defined, $header =~ $reHeader; ## trim the trailing whitespace from the keys s[\s*$][] for @keys; ## Use the capture position arrays (@- & @+) ## to work out the field widths and construct a template my $tmpl = join ' ', map{ defined( $-[$_] ) ? do{ my $n = $+[$_] - $-[$_]; "a$n" } : () } 1 .. $#+; ## read and chomp the values line chomp( my $vals = <DATA> ); ## Extract the value fields using the template my @vals = unpack $tmpl, $vals; ## trim leading & trailing whitespace s[^\s*][],s[\s*$][] for @vals; ## Add the key/value pairs to the hash @data{ @keys } = @vals; ## discard the blank line between the grouped pairs of lines. <DATA>; } pp \%data; ## display the hash constructed __DATA__ TRHYST TROFFSETP TROFFSETN AWOFFSET BQOFFSET 2 0 5 3 HIHYST LOHYST OFFSETP OFFSETN BQOFFSETAFR 5 3 0 3 CELLR DIR CAND CS LUC083A MUTUAL BOTH NO

      Outputs:

      C:\test>junk79 { AWOFFSET => 5, BQOFFSET => 3, BQOFFSETAFR => 3, CAND => "BOTH", CELLR => "LUC083A", CS => "NO", DIR => "MUTUAL", HIHYST => 5, LOHYST => 3, OFFSETN => "", OFFSETP => 0, TRHYST => 2, TROFFSETN => "", TROFFSETP => 0, }

      Extending that to apply it to all your other sections will require a little ingenuity and a lot of painstaking testing.

      I do hope for your sake that the number and ordering of the different sections is well-defined, else you've got an even worse task on your hands.

      Note:This assumes that field names do not contain spaces. If they do, you are in shit street.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      RIP Neil Armstrong

        Hi,
        I can't thank you enough for the help. i know it looks pretty messy but the good part is i don't need to read every section (thank god for that.!). I would have been in deep shit otherwise. Anyways, thanks for the heads up.
        why is the code not able to read the following ???
        CELL LUC325C CELLR DIR CAND CS LUC325B MUTUAL BOTH NO KHYST KOFFSETP KOFFSETN LHYST LOFFSETP LOFFSETN 3 0 3 0 TRHYST TROFFSETP TROFFSETN AWOFFSET BQOFFSET 2 0 5 3 HIHYST LOHYST OFFSETP OFFSETN BQOFFSETAFR 5 3 0 3
        The value for cell key should be LUC325C but i keep getting LUC3.. thats it ..!! help appreciated ..!!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1000244]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (14)
As of 2014-07-31 13:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (248 votes), past polls