http://www.perlmonks.org?node_id=486645

Bentov has asked for the wisdom of the Perl Monks concerning the following question:

Greetings to the perl monks, I seek your wisdom. I have some text files that I need to convert to a .csv, but there is a small twist to it. The file in in the following format.
heading1: text1 text2 text3 . . . heading2: text1 text2 text3 . .
and so on. Oh, and yes, the colons after the headings are part of the data. I need to encode the file with the text lines under heading1 to be together, and then under heading2 to be together and so on, however, I need to keep the CRLF at the end of the text lines. The lines are not a fixed length. I was heading towards something that would parse the file by pattern matching for the headings, but wasn't sure how to pull out the data between the headings. My goal for this is the import this file into a table where the fields names are the same as the header names, and then the multiple lines are the data in the fields. I hope that makes sense, any and all help is greatly appreciated. Thanks, Bentov

Replies are listed 'Best First'.
Re: Textfile to csv with a small twist
by kvale (Monsignor) on Aug 25, 2005 at 17:57 UTC
    Parsing with a state variable ($category in this case) is one way to remember which heading th text falls under:
    use Data::Dumper; use strict; use warnings; my %parse_tree; my $category; while (my $line = <DATA>) { if ($line =~ /^(\w+:)$/) { $category = $1; $parse_tree{ $category} = []; } else { push @{$parse_tree{ $category}}, $line; } } print Dumper( \%parse_tree); __DATA__ heading1: text1 text2 text3 heading2: text4 text5 text6
    yields
    $VAR1 = { 'heading1' => [ 'text1 ', 'text2 ', 'text3 ' ], 'heading2' => [ 'text4 ', 'text5 ', 'text6 ' ] };
    In the dumped hash, note that the newlines are preserved.

    Update: altered the regex to capture the colon.

    -Mark

Re: Textfile to csv with a small twist
by jZed (Prior) on Aug 25, 2005 at 17:46 UTC
    Two things I don't understand: 1) how do you recognize a heading? A hard-coded list? Anything with a trailing colon? Something else? 2) What do you want the table structure to be? I understand the headings are columns, but what about the rows? Is it like this or something else:
         heading1 | heading2
         ---------+---------
         text1    | text1
         text2    | text2
    
Re: Textfile to csv with a small twist
by InfiniteSilence (Curate) on Aug 25, 2005 at 17:51 UTC
    There are probably a bunch of regexes you can use to do this with the /s flag, but I would just do it programmatically:
    #!/usr/bin/perl -w my $output = ''; while(<DATA>){ if(/\:$/){$output .= qq|\n$_|} else {chomp($_);$output .= $_ . q|, +|}; } print $output; 1; __DATA__ heading1: this is a tst heading2: this is another test The result is: heading1: this,is,a,tst, heading2: this,is,another,test, C:\Temp>

    Celebrate Intellectual Diversity

Re: Textfile to csv with a small twist
by GrandFather (Saint) on Aug 26, 2005 at 03:23 UTC

    It seems that what you want to do is refactor the data from:

    heading1: h1 text1 h1 text2 h1 text3 heading2: h2 text1 h2 text2 h2 text3 ...

    to:

    heading1,heading2 h1 text1,h2 text1 h1 text2,h2 text2 h1 text3,h2 text3 ...

    in which case you need to read the data into an array of arrays (where each sub array contains all the data for a column with the header first). You then need to write the data out (using one of the csv modules) one field per major array element by unshifting the first element out of each sub array.

    If this is what you are trying to achieve and you need more help with the implementation, ask again and you shall receive :).


    Perl is Huffman encoded by design.
      Thanks to all of you! I truly appreciate all of the help that I have been given, I was able to complete my task and I wouldn't have been able to w/o your help. Thanks Again, Bentov
Re: Textfile to csv with a small twist
by sapnac (Beadle) on Aug 25, 2005 at 17:53 UTC
    Hello!
    Do you know the no. of columns that are involved ? If you do then One approach is While not EOF read the file while not column count read the file and put the data in the other file (prepend/append comma depending on the detailed logic) endwhile line break End while Eof I had similar situation and this is how I went about it. Hope it helps!
Re: Textfile to csv with a small twist
by pbeckingham (Parson) on Aug 25, 2005 at 18:22 UTC

    #! /usr/bin/perl use strict; use warnings; my %data; my @columns; my $current; my $line; while (<DATA>) { chomp; $line = $_; if ($line =~ /:$/) { push @columns, $line; $current = $line; $data{$current} = (); } else { push @{$data{$current}}, $line; } } print join (",", @columns), "\n"; my $count = @{$data{$columns[0]}}; for my $i (0 .. $count - 1) { for my $c (0 .. $#columns) { print $data{$columns[$c]}[$i]; print "," if $c < $#columns; } print "\n"; } __DATA__ heading1: text1 text2 text3 heading2: text1 text2 text3
    Output is:
    heading1:,heading2: text1,text1 text2,text2



    pbeckingham - typist, perishable vertebrate.
      Except that the OP specified he/she wants the newlines as part of the data. So instead of trying to hand roll your CSV generator, use Text::CSV_XS or another CSV parsing module that's capable of recognizing and handling embedded newlines, embedded quotes, and other features hand-rolled CSV parsing usually miss.

        But it doesn't do any CSV parsing - it just reads lines. What exactly would you do with a "CSV parsing module that's capable of recognizing and handling embedded newlines, embedded quotes, and other features hand-rolled CSV parsing usually miss"? There are only text lines to read, and only CSV lines to produce. I was just illustrating a method of reading data and transposing it for output.



        pbeckingham - typist, perishable vertebrate.
Re: Textfile to csv with a small twist
by ChrisR (Hermit) on Aug 25, 2005 at 18:42 UTC
    If I understood your post correctly, here's one way:
    #!c:\perl\bin\perl -w use strict; my $currentheading; my %hash; my @headings; my @array; open(FILE,"c:\\test.txt"); while(my $line = <FILE>) { chomp($line); if($line =~ /(.*?:)$/) { $currentheading = $1; push @headings, $currentheading; } else { push @{$hash{$currentheading}}, $line; } } for my $x(0..$#headings) { my $record = 0; for my $y(0..$#{$hash{$headings[$x]}}) { $array[$record][$x] = $hash{$headings[$x]}[$y]; $record++; } } my $header = join ",",@headings; print "$header\n"; for my $x(0..$#array) { my $recordline = join ",", @{$array[$x]}; print "$recordline\n"; }
    This will handle missing fields in certain records. Perl will issue a warning however if a field is missing.

    Note: I removed the newlines to show the data in an easily readable format. To keep them just remove the line: chomp($line);

    Update:looks like pbeckingham beat me to a similar solution.

      You make the same mistake as pbeckingham - CSV seems like a simple join with commas, but that only works for very simple CSV. If there are embedded commas, quote marks, or newlines, the join will produce garbage. Use a CSV parsing module!

        The data provided is clearly *sample* data, and the code provided is of the same nature. The OP is asking about how to approach this problem. You're complaining that complete, robust solutions are not being provided, and that's where I think you are missing the point.



        pbeckingham - typist, perishable vertebrate.
        Wow! I appreciate all of the input to my problem. InfiniteSilence's output is the closest to what I'm looking for(I haven't looked at the output from all of the example yet); however seeing the varied replies, I see I didn't explain myself clearly enough. I am basically looking for output like his/hers, except w/o the commas, and still have the crlfs in there. I belive I can modify the code provided to suit my needs, but what do I know? My perl knowledge only fills a matchbook :( Bentov