Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

parse file per customized separator / block / metadata

by raiten (Acolyte)
on Mar 06, 2010 at 16:04 UTC ( #827139=perlquestion: print w/ replies, xml ) Need Help??
raiten has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow monkers,

I'm trying to solve a simple problem: grep some multiple bigs files having different separator pattern with the same matching rule and output the corresponding block (not only the line). Some sort of grepmail but for any kind of text file.

To be more explicit, the input files could contained multiple blocks like:

case 1:

header1=val1 header2=val2 data
case 2:
header1: val1 header2: val2 data

in case 3, like or not previous ones but with a separator line like '^-=+$'

matching rules need to be customized each time. input files could be hundrer with size in Gigabytes. performance needs to be acceptable :)

For now, except manual parsing, the only relevant module that I found is Parse::File::Metadata (1). Has anyone some hints of modules or else to manage this ?

thanks a lot.

Cheers

(1) http://search.cpan.org/~jkeenan/Parse-File-Metadata-0.04/lib/Parse/File/Metadata.pm

Comment on parse file per customized separator / block / metadata
Select or Download Code
Re: parse file per customized separator / block / metadata
by almut (Canon) on Mar 06, 2010 at 16:43 UTC

    Not sure I understand entirely. Could you elaborate on the details, i.e. is there some constant record separator between the multi-line parts you want to extract? Where does the to-be-customized separator occur; is it just the thing between header and val — in which case you could possibly find a regex that handles all cases in one go while matching against the records.  A few more lines of sample input, and some sample search commands might help.

Re: parse file per customized separator / block / metadata
by Utilitarian (Vicar) on Mar 06, 2010 at 17:03 UTC
    Try using \Q and \E to escape the special characters in the split. eg.
    #!/usr/bin/perl use strict; use warnings; my($first_line,%records,$sep); chomp($first_line =readline (DATA)); seek (DATA,462,0);# use seek(FILEHANDLE,0,0) to return to start of fil +e. print "Please enter the seperator example line follows,:\n\t\"$first_l +ine\"\n"; chomp($sep=<STDIN>); while(<DATA>){ my @record=split/\Q$sep\E/,$_; $records{$record[0]}=$record[1]; } for my $name (sort(keys %records)){ print "$name has the value $records{$name}\n"; } __DATA__ name :-$& value anim :-$& freagra nomme :-$& valeur
    Works as follows:
    ~/tmp$ perl tmp.pl Please enter the seperator example line follows,: "name :-$& value" :-$& anim has the value freagra name has the value value nomme has the value valeur

    print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."
Re: parse file per customized separator / block / metadata
by repellent (Priest) on Mar 06, 2010 at 21:51 UTC
    Given the lack of information about what you're trying to achieve, seems like Parse::File::Metadata is a good fit so far. Why are you looking for another alternative?

    Another module that comes to mind is File::Stream, which would help with grepping blocks of data with different separator patterns.

      Input file could be like

      header1=val1 header1b=val1b data1 ================== header2: val2 header2b: val2b data2 =============== header3: val3 header3b: val3b data3 header4= val4 header4b= val4b data4

      For these 4 blocks of data, I want to extract the ones matching one or multiple regexp. I could grep them, but I need to reform the data block after, so I'm looking in alternative solutions, module/library or tool. The separator could change in the same file.

      I quickly check File::Stream (1) and it seems a possible option.

      about why I search for different solutions, that's a kind of common challenge :). find different views of the problem, differents solutions, more performance, more clean code, more portable and so on ...

      (1) http://search.cpan.org/~smueller/File-Stream-2.20/lib/File/Stream.pm
      http://www.justskins.com/forums/file-stream-confusion-80665.html

        It would really help if there were some = equal signs separating data3 and header4. Seems like File::Stream doesn't handle lookaheads that well. Nevertheless, here's an example that may help:
        use File::Stream; my $lookahead_regex = qr/\w+[=:]/; my ($handler, $stream) = File::Stream->new( *DATA, separator => qr/\n=*\n$lookahead_regex/, ); my $lookahead = ""; while (my $block = <$stream>) { $block =~ s/($lookahead_regex)$//; $block = $lookahead . $block; $lookahead = $1; print $block; print "-" x 60, "\n"; } __END__ header1=val1 header1b=val1b data1 ================== header2: val2 header2b: val2b data2 =============== header3: val3 header3b: val3b data3 header4= val4 header4b= val4b data4

        Output:

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://827139]
Approved by McDarren
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (10)
As of 2014-08-28 06:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (257 votes), past polls