Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Regexp and reading a file n-lines at time

by epimenidecretese (Acolyte)
on Feb 01, 2010 at 10:43 UTC ( #820704=perlquestion: print w/replies, xml ) Need Help??

epimenidecretese has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a cook book in .txt format and I'd like to tag in xml the title,the ingredients and the procedure of each recipe.

The text look like this:

1.TITLE OF FIRST RECIPE abstract Ingredient 1. Ingredient 2. Ingredient n... Procedure... 2.TITLE OF SECOND RECIPE ...

1-I've tried to put all the file in one string and then apply the regex to it,but if the words matches,then I got printed all the text,and not just my line.

2-If instead I try to read the file line by line I can't search something like /[0-9]\. .*\n\n/ because I can't catch more than one \n.

3-Then I tried to create an array in this way

my @array=split (//,$linea);
but then I don't know how to handle it(I'm a perl beginner and this is my first post).

Here is my last attempt.All the \n are doubled so I've changed them with a single \n.

#!/usr/bin/perl use warnings; use diagnostics; use strict; my $text_file="cookbook.txt"; my $output="output.txt"; open(my $INPUT,"<",$text_file ) || die "Error: $!\n"; open(my $OUT,">",$output) || die "Error: $!\n"; my $linea = do { local $/; <$INPUT> }; $linea=~ s/[\r\n]/\n/g; $linea=~ s/\n\n/\n/g; if ($linea=~ //) { print $OUT $linea; } close $OUT; close $INPUT; exit;

So,I just want to catch(and print just what I've matched!) some text between more than one \n.

Thanks in advance,and sorry for my english.

Replies are listed 'Best First'.
Re: Regexp and reading a file n-lines at time
by ahmad (Hermit) on Feb 01, 2010 at 12:58 UTC

    This would work, You'll get hash of arrays

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my $HASH; my $Title; while ( <DATA> ) { chomp; next if ! $_; if ( /^\d+\.(.*)/sg ) { $Title = $1; $HASH->{$Title} = []; next; } if ( exists $HASH->{$Title} ) { push @{$HASH->{$Title}} , $_; } } print Dumper($HASH); __DATA__ 1.TITLE OF FIRST RECIPE abstract Recipe 1. Recipe 2. Recipe ... Procedure... 2.TITLE OF SECOND RECIPE ...
      if ( /^\d+\.(.*)/sg ) {

      Your pattern is anchored at the beginning of the string so the /g option is superfluous.    The string in $_ contains only one line so the use of the /s option is also superfluous.

      Yes,ahmad's solution works perfectly.

      Now,I just have to understand it completely and tag the output.

      Could you give me a hint where to look for(in the documentation) to understand the code inside the while block ?

      Thank you very much.

      I got interesting results with hash of arrays,but I don't know how to access the single titles and then each element of the arrays into them.

      I think I'm working with anonymous hashes and arrays,right ? In this case,how do I access to an anonymous hash ?

       $HASH->{$Title} = []; This is just here to initialize an empty array ?

      I'm looking for something like:

      my $abstract=shift @{$HASH->{$Title}->{???}}#here I need to access eac +h single title in order to get the first element of each (anonymous) +hash print "<abstract>"."$abstract"."</abstract>\n"; #here I should print all the element between the first and the last o +ne of the array in each single hash my $procedure=pop @{$HASH->{$Title}}#same as before print "<procedure>"."$procedure"."</procedure>\n";

      One of Crete's own prophets has said it: 'Cretans are always liars, evil brutes, lazy gluttons'.
      He has surely told the truth.

        I finally got it!

        Here is my solution and it works perfectly:

        foreach $Title (keys %{$HASH}) { print $OUT "<recipe>".$Title."</recipe>\n"; my $abstract=shift @{$HASH->{$Title}}; print $OUT "<abstract>".$abstract."</abstract>\n"; my $procedure=pop @{$HASH->{$Title}}; foreach my $ingredient (@{$HASH->{$Title}}) { print $OUT "<ingredient>".$recipe."</ingredient>\n"; } print $OUT "<procedure>".$procedure."</procedure>\n"; }

        Thank you very much,guys!I think I'm going to love this place.

        One of Crete's own prophets has said it: 'Cretans are always liars, evil brutes, lazy gluttons'.
        He has surely told the truth.

Re: Regexp and reading a file n-lines at time
by BioLion (Curate) on Feb 01, 2010 at 12:26 UTC

    If I follow you correctly, you want to read in the raw text, xml tag the components and print it back out?

    You have a fairly loose 'format', but if it looks like the example you posted, maybe there are a few workarounds:

    use strict; use warnings; ## buffer for holding each Recipe my @buffer; ## read data while(<DATA>){ if (m/^\d+\.\U.+\E$/){ ## begins with a number and is in all upper case ## must be a title, so process what we already have process_buffer(\@buffer); ## reset buffer @buffer = ($_,); } else { push @buffer, $_; } } ## don't miss the last one! process_buffer(\@buffer); ### SUBS ### sub process_buffer{ ## collect buffer my @buffer = @{$_[0]}; ## need at least two lines... return 0 unless (scalar@buffer > 1); ## process it ... } __DATA__ 1.TITLE OF FIRST RECIPE abstract Recipe 1. Recipe 2. Recipe ... Procedure...

    I'll leave it up to you how to further break down the sections, but hopefully this is a start.

    Are you familiar with the various XML handling modules on CPAN? : XML (I am no expert, but XML::Twig seems popular here, but XML::Quick looks useful for you).

    If you want more help on regexes, check out the perldoc tutorials : perlretut

    Hope this helps, keep us posted on your progress!

    Just a something something...

      I'm sorry,I'had maid a big mistake in describing the input data.

      I had written Recipe 1. insted of Ingredient 1..

      Anyway,I've fixed it and I'm working on your tips.I think I will spend some time understanding your code.Thank you very much.

        I guess the main thing to understand is using the buffer to break the 'recipes' up so they can be processed individually - Super Search should find you other examples of this idiom in use. The buffer will contain the lines individually, but they can so joined up or whatever you like!

        There are also many Parse modules on cpan, but i think probably your situation is specific and simple enough to not get confused with those! Anyway, good luck!

        Just a something something...
Re: Regexp and reading a file n-lines at time
by graff (Chancellor) on Feb 02, 2010 at 02:03 UTC
    If you are confident that all the blank lines are really blank (no spaces or tabs), and if the organization of each recipe really has the same sequence of elements every time, you can read in "paragraph mode" (see the description of the INPUT_RECORD_SEPARATOR variable $/ in perlvar):
    #!/usr/bin/perl use strict; use warnings; $/ = ""; # empty string sets "paragraph mode": reads up-to/including +blanks my @parts; my $xml_format = "<recipe>\n<title>%s</title>\n". " <abstract>\n%s\n </abstract>\n". " <ingredients>\n%s\n </ingredients>\n". " <procedure>\n%s\n </procedure>\n</recipe>\n"; print "<cookbook>\n"; while (<DATA>) { s/\n+$//; # trim off trailing line-breaks if ( /^\s*\d/ ) { # record begins with a number: start of new rec +ipe if ( @parts ) { # print previous recipe if there was one printf( $xml_format, @parts ); @parts = (); } push @parts, $_; } elsif ( @parts == 4 ) { # we have title, abstract, ingredients and + some procedure $parts[3] .= "\n\n$_"; # so just append this paragraph to pro +cedure } else { # this is either the abstract, ingredients or start of proc +edure push @parts, $_; } } printf( $xml_format, @parts ) if ( @parts ); print "</cookbook>\n"; __DATA__ 1.TITLE OF FIRST RECIPE abstract Ingredient 1. Ingredient 2. Ingredient n... Procedure... 2.TITLE OF SECOND RECIPE second abstract rum cola ice Just mix it all in a glass, drink it and be happy.
    (updated opening sentence to make better sense; also removed an unnecessary array from the script)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://820704]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (7)
As of 2019-11-13 17:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Strict and warnings: which comes first?



    Results (74 votes). Check out past polls.

    Notices?