Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Paragraph grep: request for testing, comments and feedbacks

by siberia-man (Beadle)
on Oct 04, 2017 at 18:30 UTC ( #1200667=perlmeditation: print w/replies, xml ) Need Help??

Hello Monks, I came here for your critics, feedbacks and proposals for improvements. I have developped the simple script for grepping paragraphs (block of text lines delimited by the specific separator (blank lines, by default).

The common use case is parsing of java log entries that can be extended onto multiple lines:
paragrep -Pp '^\d+/\d+/\d+ \d+:\d+:\d+' PATTERN FILENAME
Another use case is filtering sections from ini files matching particular strings:
paragrep -Pp '^\[' PATTERN FILENAME
For now I am going to improve searching patterns and add support for -a/--and and -o/--or options to control matches. Using this message I ask you to test the script and point me on possible leaks in performance and efficiency.

The original and actual code is hosted on github -- https://github.com/ildar-shaimordanov/perl-utils
Here is the latest (to the moment of creating this message) version of the script:
#!/usr/bin/env perl

=head1 NAME

paragrep - grep-like filter for searching matches in paragraphs

=head1 SYNOPSIS

    paragrep --help
    paragrep OPTIONS

=head1 DESCRIPTION

paragrep assumes the input consists of paragraphs and prints the 
paragraphs matching a pattern. Paragraph is identified as a block of text 
delimited by an empty or blank lines. 

=head1 OPTIONS

=head2 Generic Program Information

=over 4

=item B<-h>, B<--help>

Print this help message and exit.

=item B<--version>

Print the program version and exit.

=item B<--debug>

Print debug information to STDERR.

=back

=head2 Paragraph Matching Control

=over 4

=item B<-p> I<PATTERN>, B<--break-of-paragraph=>I<PATTERN>

Use I<PATTERN> as the pattern to identify the break of paragraphs. By 
default, this value is C<^\s*$>. The break of paragraphs is considered as 
a separator and excluded from the output.

=item B<-P>, B<--begin-of-paragraph>

If this option is specified in the command line, the meaning of the option 
B<-p> is modified to identify the first line of the paragraph which is 
considered as the part of a paragraph.

=back

=head2 Matching Control

=over 4

=item B<-e> I<PATTERN>, B<--regexp=>I<PATTERN>

Use I<PATTERN> as the pattern. This can be used to specify multiple search 
patterns, or to protect a pattern beginning with a hyphen (I<->). 

This option can be specified multiple times or omitted for briefness. 

=item B<-i>, B<--ignore-case>

Ignore case distinctions in both the I<PATTERN> and the input files. 

=item B<-v>, B<--invert-match>

Invert the sense of matching, to select non-matching paragraphs.

=item B<-w>, B<--word-regexp>

Select only those paragraphs containing matches that form whole words. The 
test is that the matching substring must either be at the beginning of the 
line of each paragraphs, or preceded by a non-word constituent character. 
Similarly, it must be either at the end of the line of each paragraphs or 
followed by a non-word constituent character. Word-constituent characters 
are letters, digits, and the underscore. 

=back

=head1 EXAMPLES

The following example demonstrates the customized paragraph definition for 
parsing log files. Usually, applications producing log files write one log 
entry per one line. Somethimes applications (especially written in Java) 
produce multiline log entries. Each log entry begins with the timestamp in 
the generalized form C<date time>, which can be covered by the pattern 
C<\d+/\d+/\d+ \d+:\d+:\d+> without reflecting on which date format has 
been used to output dates:

    paragrep -Pp '^\d+/\d+/\d+ \d+:\d+:\d+' PATTERN FILENAME

=head1 SEE ALSO

grep(1)

perlre(1)

=head1 COPYRIGHT

Copyright 2017 Ildar Shaimordanov E<lt>F<ildar.shaimordanov@gmail.com>E<gt>

This program is free software; you can redistribute it and/or modify it 
under the same terms as Perl itself.

=cut

# =========================================================================

use strict;
use warnings;

no warnings "utf8";
use open qw( :std :utf8 );

use Pod::Usage;
use Getopt::Long qw( :config no_ignore_case bundling auto_version );

our $VERSION = "0.2";

my $debug = 0;
my $verbose = 0;

my $break_of_para = '^\\s*$';
my $begin_of_para = 0;

my $ignore_case = 0;
my $invert_match = 0;
my $word_regexp = 0;

my @patterns = ();
my $match_pattern;

my @globs = ();
my @files = ();

# =========================================================================

pod2usage unless GetOptions(
	"h|help" => sub {
		pod2usage({
			-verbose => 2, 
			-noperldoc => 0, 
		});
	}, 

	"debug" => \$debug, 

	"p|break-of-paragraph=s" => \$break_of_para, 
	"P|begin-of-paragraph" => \$begin_of_para, 

	"e|regexp=s" => \@patterns, 

	"i|ignore-case" => \$ignore_case, 
	"v|invert-match" => \$invert_match, 
	"w|word-regexp" => \$word_regexp, 

	"<>" => sub {
		push @globs, $_[0];
	}, 
);

# =========================================================================

sub validate_re {
	my ( $v, $k, $ignore_case, $word_regexp ) = ( shift, shift || "<anon>", shift, shift );
	$v = "\\b($v)\\b" if $word_regexp;
	my $re = eval { $ignore_case ? qr/$v/im : qr/$v/m };
	die "Bad regexp: $k = $v\n" if $@;
	$re;
}

# If no patterns, assume the first item of the list is the pattern
push @patterns, shift @globs if ! @patterns && @globs;

# Validate all the patterns before combining into the single one
pod2usage unless @patterns;
map { validate_re $_, "pattern", $ignore_case } @patterns;

# Combine all patterns into the single pattern
$match_pattern = validate_re join("|", @patterns), "", $ignore_case, $word_regexp;

# Expand filename patterns
@files = map { glob } @globs;

# If the list of files is empty, assume reading from STDIN
push @files, "-" unless @files;

# Validate and setup the pattern identifying paragraphs
$break_of_para = validate_re $break_of_para, "break-of-paragraph";

# =========================================================================

warn <<DATA if $debug;
PARAGRAPH MATCHING CONTROL
    break-of-paragraph = $break_of_para
    begin-of-paragraph = $begin_of_para

MATCHING CONTROL
    match-pattern = $match_pattern
    invert-match  = $invert_match

FILES
    @files
DATA

# =========================================================================

my $para;

sub print_para {
	print $para if defined $para && ( $para =~ m/$match_pattern/ ^ $invert_match );
	$para = "";
}

sub grep_file {
	my $file = shift;

	if ( $file eq "-" ) {
		*FILE = *STDIN;
	} else {
		if ( -d $file ) {
			warn "Not a file: $file\n";
			return;
		}
		open FILE, $file or do {
			warn "Unable to read file: $file\n";
			return;
		};
	}

	while ( <FILE> ) {
		if ( m/$break_of_para/ ) {
			print_para;
			next unless $begin_of_para;
		};
		$para .= $_;
	}

	print_para if $para;

	close FILE unless $file eq "-";
}

# =========================================================================

grep_file $_ foreach ( @files );

# =========================================================================

# EOF
Thank you

Replies are listed 'Best First'.
Re: Paragraph grep: request for testing, comments and feedbacks
by hippo (Canon) on Oct 05, 2017 at 10:39 UTC
    The original and actual code is hosted on github (It's not permitted to post external links but you can search for ildar-shaimordanov/perl-utils)

    Actually, posting external links is fine in general. What is frowned upon is refusing to post code here and instead saying "You can see my code at http://www.geocities.com/..." because over time the externally linked code may degrade or vanish and the resultant thread is then rather moot. Since you've posted your script here as it stands I fail to see how anyone could also object to linking to your github repo, especially when the README there links back to the monastery.

    Your script looks in pretty good shape to me from a cursory inspection. I'll be pleased to test it out when I have some time.

      Couple minutes ago I've been able to update my post adding the link to the guthub repository in the front of the source code. I think it would be better, if some one looking for the similar functionality would have found the link to the actual version next to the initial code published here.
Re: Paragraph grep
by Anonymous Monk on Oct 04, 2017 at 18:50 UTC
    $/="" and -00
    $ cat input.txt Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean quis elit tempus, hendrerit sem a, maximus urna. Aenean vitae est at risus fringilla egestas vitae in lacus. In a metus vel elit varius rhoncus. Morbi at sem euismod, tincidunt nunc quis, maximus quam. Sed maximus nibh vel suscipit ullamcorper. Mauris sed ex ut nulla accumsan feugiat. Donec sit amet sapien laoreet mauris sodales scelerisque. Aliquam varius diam sit amet mollis iaculis. Quisque vel neque auctor, feugiat velit eleifend, ultrices nunc. Vivamus condimentum metus quis nunc tincidunt lobortis. Fusce a dolor sed tellus condimentum vulputate. Proin ac tortor ut metus mattis gravida. Ut quis orci ornare, aliquet dolor id, commodo justo. $ perl -ln00e '/sed/i and print' input.txt In a metus vel elit varius rhoncus. Morbi at sem euismod, tincidunt nunc quis, maximus quam. Sed maximus nibh vel suscipit ullamcorper. Mauris sed ex ut nulla accumsan feugiat. Donec sit amet sapien laoreet mauris sodales scelerisque. Aliquam varius diam sit amet mollis iaculis. Quisque vel neque auctor, feugiat velit eleifend, ultrices nunc. Vivamus condimentum metus quis nunc tincidunt lobortis. Fusce a dolor sed tellus condimentum vulputate.
      Thanks for your comment. I know these options. But they don't solve the task of parsing log files. Most probably, I haven't been very specific and some explanations are required. A log file could be:
      2017-09-04 22:02:14.123 INFO: Some log message having param1=value1 2017-09-04 22:02:14.349 DEBUG: Multiline log entry Some extended logging: debug { param1 value1 param2 value2 } 2017-09-04 22:02:14.658 INFO: Another log message param2=value2
      If we need all entries containing some specific strings (let say value1), it is difficult to parse the file with -00. That's why I (re)invented a bike. :)
Re: Paragraph grep: request for testing, comments and feedbacks
by siberia-man (Beadle) on Nov 27, 2017 at 00:09 UTC
    In continuation of this thread I am happy to say that I improved and extended the script. The new options --file=FILE, --or and --and are shipped with new version. In accordance of the script description they work as follows:

    -f FILE, --file=FILE

    Obtain patterns from FILE, one per line.

    -A, --and, -O, --or

    These options specify whether multiple search patterns specified by the -e options should be logically ANDed together or logically ORed together. If not specified, the patterns are assumed logically ORed. These options can be used to simplify the commands searching for matches to multiple patterns. More than one of them can be specified but the only last pattern has affect.

    The following example shows how the combining option simplifies usage. The resulting output will consist of the paragraphs matching both PATTERN1 and PATTERN2.
    cat FILENAME | paragrep -e PATTERN1 -e PATTERN2 -A cat FILENAME | paragrep -e PATTERN1 | paragrep -e PATTERN2
    Welcome for meditations, Monks :)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://1200667]
Approved by haukex
Front-paged by haukex
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (6)
As of 2018-07-20 09:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?















    Results (427 votes). Check out past polls.

    Notices?