Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

text processing

by DAVERN (Initiate)
on Apr 22, 2014 at 17:10 UTC ( [id://1083201]=perlquestion: print w/replies, xml ) Need Help??

DAVERN has asked for the wisdom of the Perl Monks concerning the following question:

Beginner on PERL do not find the way to skip some lines (headers, table name) and do something else with the rest of the lines (print with some modifications), could someone help me?

Data as follows

TABLE NAME HEAD0 HEAD1 HEAD2 DATA00 DATA10 DATA20 DATA01 DATA11 DATA21 END

i need the next result

xxx=DATA00, xxx=DATA10, xxx=DATA20;

Replies are listed 'Best First'.
Re: text processing
by Limbic~Region (Chancellor) on Apr 22, 2014 at 17:23 UTC
    DAVERN,
    There are many ways to do this because one of Perl's mottos is there is more than one way to do it.

    The first method is probably the most common and easiest to think of. At the top of the loop, skip lines that you don't want

    while (<DATA>) { next if ! /^DATA/; # ...

    Another common first step might be to throw away the first N lines

    <DATA> for 1 .. 3; while (<DATA>) { # ...

    Sometimes it gets more complicated and you need to check a state variable against multiple lines. I won't give you an example of that abstract case but I will show you what some people do:

    my $found_start_of_data; while (<DATA>) { last if $found_start_of_data; # ... some complex code that sets the flag } while (<DATA>) { # ... }
    The above has a micro-optimization which you should avoid unless you need it. Essentially, it avoids paying the penalty of checking to see if we are in the good data against all lines and starts a new loop with only the processing we care about.

    The final method I will share is where you extract or eliminate what you don't want.

    # Extract my ($want) = $data =~ m{(some_regular_expression)}; # Eliminate $data =~ s{some_regular_expression}{};
    As you can see, there are many ways to do what you are looking to accomplish. If you don't understand something, please ask.

    Cheers - L~R

      Hi Limbic~Region, i did it on two separate programs on the first one i delete from the file the lines i do not use and i generate a new file, on the second program i process the rest of the text, i want to join it but do not find the way

      my $output = 'output.txt';

      open my $outfile, '>', $output or die "Can't write to $output: $!";

      my @array = read_file('file1.log');

      for (@array){

      next if ($_ =~ /^\TABLE NAME|HEAD0|END|^\s+$/);

      print $outfile $_ ;

      Second file:

      open my $IN, '<', 'output.txt' or die $!;

      my @lines = <$IN>;

      close $IN;

      open my $OUT, '>', 'file2.txt' or die $!;

      for my $line(@lines){

      chomp $line;

      my @data = split /\s+/, $line;

      print {$OUT} "xxxxx", $data[0], "yyy", $data2,";","\n";

      }

      close $OUT;

      I do not have idea of to do it all in only one program

      BR

        Your focus appears to be all wrong. If you are looking for something specific in a file why not just select that thing?

        my @output = (); while(<DATA>){ next unless (m/DATA/); my $line = $_; while($line=~m/(DATA\d+)/g){ push @output,$1; } } print join qq|,|, map {qq~xxx=$_~} @output; print qq|;\n|; 1; __END__ TABLE NAME HEAD0 HEAD1 HEAD2 DATA00 DATA10 DATA20 DATA01 DATA11 DATA21 END
        Produces...
        xxx=DATA00,xxx=DATA10,xxx=DATA20,xxx=DATA01,xxx=DATA11,xxx=DATA21;

        Celebrate Intellectual Diversity

Re: text processing
by davido (Cardinal) on Apr 22, 2014 at 19:34 UTC

    The problem seems simple enough, but isn't specified completely enough for a complete solution that doesn't involve a bit of lucky guessing. Is there another record that comes after END, for example? Are the number of fields the same for each row? Are there always two rows per record?

    At minimum, it does appear that you're dealing with fixed-width fields, and that you want to skip the first four lines. It's not clear to me what you want to have happen after "END" (continue on to a new record, or stop? And will that next record have its own headers? Will it have the same format as the first record?

    For fixed-width fields, you might want to use unpack, as my @fields = unpack '(a7x)2a7', $_;, for example. This will have to come after whatever logic you use to disqualify some lines. That logic might look like this:

    while( <DATA> ) { next if $. < 5; chomp; next if ! length; last if /^END/; my @fields = unpack '(a7x)2a7', $_; # Do something with the fields. }

    This would change a bit if there are more than one record you're interested in. You might incorporate the flip-flop operator like this:

    my $record_start = 0; my @recs; while( <DATA> ) { chomp; if( /^TABLE NAME/ .. /^END/ ) { # We're in a new record... if( /^TABLE NAME/ ) { $record_start = $.; push @recs, []; } next unless $. > $record_start + 3; # Skip header. next if ! length; next if /^END/; my @fields = unpack '(a7x)2a7', $_; # Do something with fields, such as... push @{$recs[-1]}, [@fields]; } }

    (Updated to demonstrate pushing records onto a "@recs" array.)


    Dave

      Actually, David, in this case I do believe that there’s enough information here to point to a classic, awk-inspired solution.   The “set of records of-interest” is clearly bounded by an identifiable “start” and “end” record, and, within that space, the set of records which contain information-of-interest are readily identifiable.   Thus, logic could be written, I think, based only on the file-example presented in the original post.   And this logic would basically be in-keeping with the metaphor that the awk tool already employs.   (Which means, of course, that a very short Perl program could also do the same.)

Re: text processing
by kcott (Archbishop) on Apr 23, 2014 at 13:10 UTC

    G'day DAVERN,

    Welcome to the monastery.

    You originally wrote:

    "i need the next result

    xxx=DATA00, xxx=DATA10, xxx=DATA20; "

    This is easy:

    #!/usr/bin/env perl -l use strict; use warnings; while (<DATA>) { next unless /^DATA/; print 'xxx=', join(', xxx=' => split), ';'; } __DATA__ TABLE NAME HEAD0 HEAD1 HEAD2 DATA00 DATA10 DATA20 DATA01 DATA11 DATA21 END

    Output:

    xxx=DATA00, xxx=DATA10, xxx=DATA20; xxx=DATA01, xxx=DATA11, xxx=DATA21;

    You then showed some unformatted code:

    "print {$OUT} "xxxxx", $data[0], "yyy", $data2,";","\n";"

    Guessing that's supposed to be:

    print {$OUT} "xxxxx", $data[0], "yyy", $data[2],";","\n";

    This is only slightly less easy. Just change the print line to:

    print 'xxxxx', join('yyy' => (split)[0,2]), ';';

    Output:

    xxxxxDATA00yyyDATA20; xxxxxDATA01yyyDATA21;

    Please give careful consideration to what you are attempting to achieve before posting. You'll find monks may be less inclined to help if you keeping changing what you want.

    Your code for opening files seems absolutely fine. Change <DATA> to <$your_input_filehandle> and print ... to print {$your_output_filehandle} ... and that should do what you want.

    -- Ken

Re: text processing (csv)
by Anonymous Monk on Apr 22, 2014 at 17:24 UTC
Re: text processing
by sundialsvc4 (Abbot) on Apr 22, 2014 at 18:22 UTC

    This sort of problem is, actually, extremely common.   It is, in fact, the inspiration for the awk tool that was one of the original inspirations for Perl.

    In general, problems like this one can solved “text line by text line,” and can be reduced, algorithmically speaking, to four cases, all of which can (somehow) be recognized by the contents of the line (and/or by “beginning of file” and/or “end of file”):

    1. A record which marks the start of some area-of-interest.   Such as, in this case, TABLE.
    2. A record which marks the end of an area (and the generation of output based on accumulated data), such as END.
    3. A record which marks data to be gathered in-memory in anticipation of future output, such as DATAnn.
    4. A record whose presence is expected but otherwise uninteresting, such as HEADn.

    Your immediate requirement could actually be addressed entirely by awk, and nothing else, and perhaps for this very reason you might elect to do so.   In any case, the man-page hyperlinked above should now be read carefully and thoroughly.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1083201]
Approved by GotToBTru
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-04-10 06:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found