Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Help with a regular expression for file name parsing

by bontchev (Sexton)
on Dec 07, 2011 at 06:51 UTC ( #942167=perlquestion: print w/ replies, xml ) Need Help??
bontchev has asked for the wisdom of the Perl Monks concerning the following question:

Hello illuminated ones,

I'm trying to parse a text file that may contain lines like

@include filename

There might be stuff before the "@include" and after the "filename"; that's not a problem. The problem is - how to fetch the file name? I cannot rely on it being a single word. It could contain spaces, in which cases it would be surrounded by single or double quotes. Or it could contain escape sequences. Here are a few examples:

#some "random stuff" @include "some file" did you parse that? #more 'random' stuff @include 'another file' you sure? #and more random stuff @include yet\ another\ file positive?

Could you suggest some kind of clever regular expression I could use to fetch the file names ("some file", "another file" and "yet another file" in the above examples)? Thanks in advance.

Comment on Help with a regular expression for file name parsing
Select or Download Code
Re: Help with a regular expression for file name parsing
by BrowserUk (Pope) on Dec 07, 2011 at 07:11 UTC

    This works with the samples supplied:

    print $data;; #some "random stuff" @include "some file" did you parse that? #more 'random' stuff @include 'another file' you sure? #and more random stuff @include yet\ another\ file positive? print for $data =~ m[\@include\s('[^']+'|"[^"]+"|.+?(?<!\\))\s]g;; "some file" 'another file' yet\ another\ file

    Spreading that out a bit:

    m[ \@include \s ## the introducer followed by a space ( ## capture '[^']+' ## A single quoted string with no embedded single + quotes | ## or "[^"]+" ## a double quoted string with no embedded double + quotes | ## or .+? (?<!\\) ## a min length string that ends in a space that +isn't escaped ) \s ]gx;;

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Your regular expression works, but the code is rather a muddle. Here's a version that he can use to test with:

      $data = join '', <DATA>; print "$_\n" for $data =~ m[\@include\s('[^']+'|"[^"]+"|.+?(?<!\\))\s] +g; __DATA__ #some "random stuff" @include "some file" did you parse that? #more 'random' stuff @include 'another file' you sure? #and more random stuff @include yet\ another\ file positive?
        When tested with this version, the output is just 1

      I am sorry, but I can't make sense of your answer. :-( If the part marked as "code" is supposed to be a script that works - well, it doesn't; it just produces a bunch of errors.

      But let's concentrate just on the regular expression, because this is what I asked for. Sadly, that doesn't work, either. :-(

      Let's start with something easy:

      my $data = "\@include test"; if ($data =~ /\@include\s+('[^']+'|"[^"]+"|.+?(?<!\\))\s+/g) { print "File name: \"$1\"\n"; }

      This doesn't output anything at all, meaning that the parsing fails.

      If we set

      my $data = "\@include \'test test\'";

      this outputs

      File name: "'test"

      which is totally wrong. It should output

      File name: "test test"

      If we try

      my $data = "\@include \"test test\"";

      this produces the similarly wrong

      File name: ""test"

      And finally, if we try

      my $data = "\@include test\\ test";

      it also produces no output, meaning that the matching fails

      Any better suggestions?

        Any better suggestions?

        Learn to copy paste better :) because the regex you're using, isn't the same one BrowserUk posted

        His regex works, despite him posting the code in the context of his REPL (Read Eval Print Loop), see RFC: IPerl - Interactive Perl ( read-eval-print loop ), Re^6: RFC: IPerl - Interactive Perl ( read-eval-print loop ) (x)

        I checked

        #!/usr/bin/perl -- #~ 2011-12-07-04:10:56PDT by Anonymous Monk #~ perltidy -csc -otr -opr -ce -nibc -i=4 use strict; use warnings; use autodie; # dies if open/close... fail Main( @ARGV ); exit( 0 ); sub Main { if ( @_ == 2 ) { NotDemoMeaningfulName(@_); } else { Demo(); print '#' x 33 ,"\n", Usage(); } } ## end sub Main sub NotDemoMeaningfulName { my ( $inputFile, $outputFile ) = @_; open my ($inFh), '<', $inputFile; open my ($outFh), '>', $outputFile; while( defined( my $data = <$inFh>) ){ print $outFh "$_\n" for $data =~ m[\@include\s('[^']+'|"[^"]+"|.+?(?<!\\))\s]g +; # /\@include\s+('[^']+'|"[^"]+"|.+?(?<!\\))\s+ +/g } close $inFh; close $outFh; } ## end sub NotDemoMeaningfulName sub Usage { <<"__USAGE__"; $0 $0 dataFile newDataFile __USAGE__ } ## end sub Usage sub Demo { my ( $Input, $WantedOutput ) = DemoData(); NotDemoMeaningfulName( \$Input, \my $Output ); require Test::More; Test::More::is( $Output, $WantedOutput, ' NotDemoMeaningfulName Works Aas Designed' ); Test::More::done_testing(); print "\n$Output\n"; } ## end sub Demo sub DemoData { #~ http://perlmonks... my $One = <<'__One__'; @include test #some "random stuff" @include "some file" did you parse that? #more 'random' stuff @include 'another file' you sure? #and more random stuff @include yet\ another\ file positive? __One__ #~ http://perlmonks... my $Two = <<'__Two__'; test "some file" 'another file' yet\ another\ file __Two__ return $One, $Two; } ## end sub DemoData __END__ $ perl pm.re.942167.pl ok 1 - NotDemoMeaningfulName Works Aas Designed 1..1 test "some file" 'another file' yet\ another\ file ################################# pm.re.942167.pl pm.re.942167.pl dataFile newDataFile
        Any better suggestions?

        For you, no. At least none that would be considered polite.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Re: Help with a regular expression for file name parsing
by Anonymous Monk on Dec 07, 2011 at 07:19 UTC

      and tested, though i had forgot to escape an \ in [^\\s]

      #!/usr/bin/perl -- #~ 2011-12-07-04:10:56PDT by Anonymous Monk #~ perltidy -csc -otr -opr -ce -nibc -i=4 use strict; use warnings; use autodie; # dies if open/close... fail Main( @ARGV ); exit( 0 ); sub Main { if ( @_ == 2 ) { NotDemoMeaningfulName(@_); } else { Demo(); print '#' x 33 ,"\n", Usage(); } } ## end sub Main sub NotDemoMeaningfulName { my ( $inputFile, $outputFile ) = @_; open my ($inFh), '<', $inputFile; open my ($outFh), '>', $outputFile; while( defined( my $data = <$inFh>) ){ print $outFh "$_\n" for $data =~ m~ \@include \s+ ( (?: '[^']*' ) | (?: "[^"]*" ) | (?: (?:\\.) | [^\\\s] )+ ) ~xg; #~ for $data =~ m[\@include\s('[^']+'|"[^"]+"|.+?(?<!\\))\ +s]g; # /\@include\s+('[^']+'|"[^"]+"|.+?(?<!\\))\s+ +/g } close $inFh; close $outFh; } ## end sub NotDemoMeaningfulName sub Usage { <<"__USAGE__"; $0 $0 dataFile newDataFile __USAGE__ } ## end sub Usage sub Demo { my ( $Input, $WantedOutput ) = DemoData(); NotDemoMeaningfulName( \$Input, \my $Output ); require Test::More; Test::More::is( $Output, $WantedOutput, ' NotDemoMeaningfulName Works Aas Designed' ); Test::More::done_testing(); print "\n$Output\n"; } ## end sub Demo sub DemoData { #~ http://perlmonks... my $One = <<'__One__'; @include test #some "random stuff" @include "some file" did you parse that? #more 'random' stuff @include 'another file' you sure? #and more random stuff @include yet\ another\ file positive? __One__ #~ http://perlmonks... my $Two = <<'__Two__'; test "some file" 'another file' yet\ another\ file __Two__ return $One, $Two; } ## end sub DemoData __END__ $ perl pm.re.942167.pl ok 1 - NotDemoMeaningfulName Works Aas Designed 1..1 test "some file" 'another file' yet\ another\ file ################################# pm.re.942167.pl pm.re.942167.pl dataFile newDataFile

        and tested

        Not tested enough, I'm afraid. Your code doesn't handle properly even such trivial cases as

        @include file
Re: Help with a regular expression for file name parsing
by TJPride (Pilgrim) on Dec 07, 2011 at 14:12 UTC
    There are really two parts to this. The first is to match the three patterns; the second to eliminate the unwanted wrapper or backslash characters. I tried to figure out a regex that would do both at once, but it's either impossible or my knowledge of regex isn't up to the task. So I cheated.

    use strict; use warnings; my $data = join '', <DATA>; my $file; while ($data =~ m/\@include (".*?"|'.*?'|(?:[^\s\\]|\\ )+)/g) { $file = $1; $file =~ s/["'\\]+//g; print "$file\n"; } __DATA__ #some "random stuff" @include "some file" did you parse that? #more 'random' stuff @include 'another file' you sure? #and more random stuff @include yet\ another\ file positive?

    CAVEAT: Assumes that ", ', and \ will never appear within filenames themselves. If they can, this gets much more complex.

      Thanks, you've been the most helpful one so far. Sadly, the above solution also doesn't solve the problem properly. However, I managed to combine it with another of the regular expressions that was proposed, plus some code for better resolving the escape sequences in the string, plus a better way of removing the quotes (only from the ends of the string - not from everywhere).

      Here is what I managed to come up with:

      use strict;
      use warnings;
      
      while (my $data = <DATA>)
      {
      	if ($data =~ /\@include/i)
      	{
      		$data =~ m/\@include\s+('^'+'|"^"+"|.+?(?<!\\))\s/gi;
      		my $fname = $1;
      		$fname =~ s/\\(rnt'"\\ )/"qq|\\$1|"/gee;
      		$fname =~ s/^"(.*)"$/$1/s or
      		$fname =~ s/^'(.*)'$/$1/s;
      		print "File name: <$fname>\n";
      	}
      }
      
      __DATA__
      #some "random stuff" @include 	"some file" did you parse that?
      #more 'random' stuff @include 'another file' you sure?
      #and more random stuff @include yet\ another\ file positive?
      #@Include file
      #	@include		"\"another one\""	hmmm...
      # some stuff

      The "if" is there because, as I've mentioned above, I have to do some other processing of the lines, too. This code mostly works although, as you say, it doesn't handle properly file names containing escaped quotes.

      Perhaps I should give up the idea of parsing this in some clever way and just process the part after the "@include" character-by-character?

        Sigh, the site mangled the code I posted. :-( I guess I've used the wrong tag. Let's try again:

        use strict; use warnings; while (my $data = <DATA>) { if ($data =~ /\@include/i) { $data =~ m/\@include\s+('[^']+'|"[^"]+"|.+?(?<!\\))\s/gi; my $fname = $1; $fname =~ s/\\([rnt'"\\ ])/"qq|\\$1|"/gee; $fname =~ s/^"(.*)"$/$1/s or $fname =~ s/^'(.*)'$/$1/s; print "File name: <$fname>\n"; } } __DATA__ #some "random stuff" @include "some file" did you parse that? #more 'random' stuff @include 'another file' you sure? #and more random stuff @include yet\ another\ file positive? #@Include file # @include "\"another one\"" hmmm... # some stuff

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://942167]
Approved by Ratazong
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2014-09-17 01:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (55 votes), past polls