Beefy Boxes and Bandwidth Generously Provided by pair Networks Cowboy Neal with Hat
XP is just a number
 
PerlMonks  

breaking a text file into a data structure -- best way?

by punkish (Priest)
on Apr 09, 2010 at 14:29 UTC ( #833823=perlquestion: print w/ replies, xml ) Need Help??
punkish has asked for the wisdom of the Perl Monks concerning the following question:

Update0: My Best buddy tells me such class of problems are called "State Machine." Googling for "Perl state machine" returns a bunch of hits that I am now in the process of digesting. In the meantime, I look to your help.

Update1: Seems like http://www.perl.com/pub/a/2004/09/23/fsms.html might have the answer for me.

I have a longish text file like below. The gutter annotation is not a part of the text file, but only to aid my question.

a> some random text ---------------- b> b> a few random b> lines b> b> of more b> random b> b> text **************** c> some more c> c> random c> text c> a> some random text ---------------- b> b> a few random b> lines b> b> of more b> random b> b> text **************** c> some more c> c> random c> text c>

I want to split the file into an array of hashes like so

@foo = ( { a => 'some random text' b => ' a few random lines of more random text ' c => 'some more random text ' }, { a => 'some random text' b => ' a few random lines of more random text ' c => 'some more random text ' }, .. and so on .. );

In other words, each hash is made up of the snippet of text starting from the line that is followed by '--------------' up to, but not including, the next line that is followed by '--------------'.

I have two questions -- one, how do I do the above? I have been hitting my head against a wall the entire day yesterday, so I come to you today. I have nothing to show you because I everything I did was wrong. My approach was mostly to start from the beginning and go to the end, trying to keep flags on when one hash element began and when it ended, and so on. Which brings me to my second question.

What is the canonical design pattern for such a problem? I come across such problems all the time, and I always slow down in trying to solve them. A pattern that is visible to the eye becomes very difficult to program. Yesterday I had another such problem which I managed to solve, if I may say so myself, rather innovatively. The text file looked like so

bri red grn blu 0 0 0 0 1 0 0 0 2 0 0 0 .. 99 0 0 0 100 0 255 255 101 0 250 255 102 0 246 255 ..

The above had to be converted to

CLASS EXPRESSION ([pixel] >= 242 AND [pixel] <= 242 STYLE COLOR 200 72 127 END END CLASS EXPRESSION ([pixel] >= 175 AND [pixel] <= 175 STYLE COLOR 191 236 0 END END ..

That is, group the brightness values by color triplets. After struggling with it for a while with the usual, line by line, flag as you go approach, I decided to turn the color triplets into keys of a hash. The problem was solved in a couple of lines, and elegantly. Here is the code for that

while (<INFILE>) { # remove leading whitespace & newline from end # and split the row on whitespace my @r = chomp && s/^\s+// && split /\s+/; # create a key in lut hash using rgb vals push @{$lut{"$r[1].$r[2].$r[3]"}}, $r[0]; } while (my ($k, $v) = each %lut) { $k =~ s/\./ /g; # replace . in hash key with space my @v = sort @$v; # sort the color brightness array to get # min/max values print OUTFILE "CLASS\n" . " EXPRESSION ([pixel] >= $v[0] AND [pixel] <= $v[$#v]\n" . " STYLE\n" . " COLOR $k\n" . " END\n" . "END\n"; }

I was able to solve above because of the uniqueness requirement, else it would have been the usual slog. So, is there a generic approach to this? And, is there a way I can validate the output... ensure that the output is what I really want, given very long input text files?

--

when small people start casting long shadows, it is time to go to bed

Comment on breaking a text file into a data structure -- best way?
Select or Download Code
Re: breaking a text file into a data structure -- best way?
by sierpinski (Hermit) on Apr 09, 2010 at 14:52 UTC
    One of the many answers to #1 would be to:

    read a line - store it
    read the next line - compare it, if it matches your ----- or ****, then save the first line as your key.
    Read the two lines.
    If either of them are your ----- or ****, then save the previous one as the next key, and the ones before it as the previous hash's values.

    Another possible solution:
    Start by reading the whole file into an array, one line per entry. Find the ----- lines, and split at one position before it, and use that section to create your hash.

    It might not be the best way, but its the first couple that come to mind.
Re: breaking a text file into a data structure -- best way?
by rubasov (Friar) on Apr 09, 2010 at 16:59 UTC
    Let me try to help:

    First analyze the structure of your input, name the parts of it. This input consist of lines, each line consists three fields: a prefix, a separator and a text field. Consecutive prefixes form a block and consecutive blocks form a prefix alphabet.

    Second, answer this question: is the input parsable line-by-line or you have to look around (at a certain point) to decide what-is-what? The former answer is typically resulting more efficient programs (but it is not possible for all types of input) and the latter is generally easier to code, but requires to hold more of your input in memory. (I decided to choose the line-by-line approach by storing the previous prefix only beyond the current line.)

    Then constrain yourself to go through your input line-by-line and ask yourself: what are the states (or state transitions) determining what should I do?

    1. at the start of a new (consecutive) prefix-block
    2. in the middle of a prefix-block
    3. prefix alphabet starting over

    How to map these states to relations between lines? By comparing the prefix of the current and the previous line.

    What is the tool to express these relations between the lines? Alphabetical comparison. The mapping is (cf. with the previous listing):

    1. $prefix gt $prev_prefix
    2. $prefix eq $prev_prefix
    3. $prefix lt $prev_prefix

    What should I do at each state transition?

    1. add $prefix => $text to the current hash
    2. append $text to the current $hash{$prefix}
    3. push a new hash ref to your array: { $prefix => $text }

    Now try to write it again and if you're stuck, come back and look at this:

    Of course this is only one approach, but the clearing of concepts, methodical thinking of the mechanical way to solve a problem always helped me.

    And in general: practice and practice more. Read books, read the code of others (not just glance over, but change them, understand them), read the problems of others and try to solve them without looking at the solution posted by others.

    Cheers
      Thanks for the response, but you misunderstood my task. The 'a>', 'b>', 'c>' are not really present in the text file. I included them as "line numbers" to illustrate where I wanted the text split up. In the specific case I presented, the text is split up at the line *before* the line that starts with '======'.

      In any case, I am curious about a general approach to such problems, and at first glance, it seems that a state machine approach would help me. However, I got stuck with that as well, especially since my splitting markers are not *in* the line where I want to split the text, but *after* the line on which I want to split.

      --

      when small people start casting long shadows, it is time to go to bed
        but you misunderstood my task
        Indeed. In the last days I'm doing really stupid things, sorry.
Re: breaking a text file into a data structure -- best way?
by ikegami (Pope) on Apr 10, 2010 at 04:18 UTC
    my $hdr = <>; <>; my @part1; my @part2; my $part = \@part1; while (<>) { if ($_ eq "----------------\n") { my $next_hdr = pop(@$part); process_rec($hdr, \@part1, \@part2); $hdr = $next_hdr; @part1 = (); @part2 = (); $part = \@part1; } elsif ($_ eq "****************\n") { $part = \@part2; } else { push @$part, $_; } } process_rec($hdr, \@part1, \@part2);
Re: breaking a text file into a data structure -- best way?
by repellent (Priest) on Apr 10, 2010 at 05:56 UTC
    Here's using the until-eof-FILEHANDLE technique:
    my @foo; my $next_a = <DATA>; scalar(<DATA>); until (eof(DATA)) { my $a = $next_a; my @b; while (my $line = <DATA>) { last if $line =~ /^[*]{16}$/; push @b, $line; } my $found_next; my @c; while (my $line = <DATA>) { last if $found_next = ($line =~ /^-{16}$/); push @c, $line; } $next_a = pop(@c) if $found_next; push @foo, { a => $a, b => join("" => @b), c => join("" => @c), }; } use Data::Dumper; print Dumper \@foo; __END__ title 1 ---------------- a few random lines of more random text **************** some more random text title 2 ---------------- cow jumped over the moon **************** corn on the cob lobster thermidor
Re: breaking a text file into a data structure -- best way?
by rubasov (Friar) on Apr 10, 2010 at 15:51 UTC
    To amend my stupidity yesterday, here's another approach by peeking into the next line: p.s.: found japhy's much better implementation for the peeking: Peek.pm

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://833823]
Approved by Corion
Front-paged by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (8)
As of 2014-04-19 22:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (485 votes), past polls