Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^2: parse file per customized separator / block / metadata

by raiten (Acolyte)
on Mar 07, 2010 at 12:12 UTC ( #827220=note: print w/ replies, xml ) Need Help??


in reply to Re: parse file per customized separator / block / metadata
in thread parse file per customized separator / block / metadata

Input file could be like

header1=val1 header1b=val1b data1 ================== header2: val2 header2b: val2b data2 =============== header3: val3 header3b: val3b data3 header4= val4 header4b= val4b data4

For these 4 blocks of data, I want to extract the ones matching one or multiple regexp. I could grep them, but I need to reform the data block after, so I'm looking in alternative solutions, module/library or tool. The separator could change in the same file.

I quickly check File::Stream (1) and it seems a possible option.

about why I search for different solutions, that's a kind of common challenge :). find different views of the problem, differents solutions, more performance, more clean code, more portable and so on ...

(1) http://search.cpan.org/~smueller/File-Stream-2.20/lib/File/Stream.pm
http://www.justskins.com/forums/file-stream-confusion-80665.html


Comment on Re^2: parse file per customized separator / block / metadata
Download Code
Re^3: parse file per customized separator / block / metadata
by repellent (Priest) on Mar 07, 2010 at 20:34 UTC
    It would really help if there were some = equal signs separating data3 and header4. Seems like File::Stream doesn't handle lookaheads that well. Nevertheless, here's an example that may help:
    use File::Stream; my $lookahead_regex = qr/\w+[=:]/; my ($handler, $stream) = File::Stream->new( *DATA, separator => qr/\n=*\n$lookahead_regex/, ); my $lookahead = ""; while (my $block = <$stream>) { $block =~ s/($lookahead_regex)$//; $block = $lookahead . $block; $lookahead = $1; print $block; print "-" x 60, "\n"; } __END__ header1=val1 header1b=val1b data1 ================== header2: val2 header2b: val2b data2 =============== header3: val3 header3b: val3b data3 header4= val4 header4b= val4b data4

    Output:

      Thanks a lot for this code and sorry for the delayed feedback.

      I try to made some tests today and the code covers most needs. The only point which fails is matching block on /^[=]+$/ (note this regexp is not accepted for block matching). my $lookahead_regex = qr/\w+[=:]/; or my $lookahead_regex = qr/[=:][=:][=:]+/; both fail.

      I can't manage to match block separator as 'headerX:' (work) AND '=========[=]+' (don't work for now)

      advices ? I'll try to continue to work on it in the next days.

      Thanks a lot

        What test case(s) do you have that failed? Do provide it.

      Sorry, it's working great. The data file need to be dos2unix-ed. Great thanks for the code, nearly perfect shot :-)

      I still need to find if there are things to optimize to handle multiple big files or pass to multithreading.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://827220]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (11)
As of 2014-08-01 11:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Who would be the most fun to work for?















    Results (8 votes), past polls