Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Insert newline

by Not_a_Number (Parson)
on Sep 14, 2011 at 18:50 UTC ( #925978=note: print w/ replies, xml ) Need Help??


in reply to Insert newline

This seems to work with your sample data (some additional test cases added):

use strict; use warnings; use 5.010; my $pat1 = qr |(\d-\d\w{2}\.\w+)|; # eg 3-123.7 2-3cd.e my $pat2 = qr |([A-Z]\d{2}\.\d+)|; # eg A12.34 Z56.78 my $title; while ( <DATA> ) { chomp; if ( /(.+)(\(.+\))$/ ) { # Better regex for 'title' lines?? $title = "$1$2;"; } else { next unless /$pat1|$pat2/; my @items = grep length, split /$pat1|$pat2/; say $title, splice @items, 0, 2 while @items; } } __DATA__ Titel Text (A12-3) 3-123.7 Just another (small) text 3-123.8 Some more text A12.34 Another item B56.78 Yet another item Another Titel Text (B23-9) Some trash here 12-22a **This is trash*** 1-22a.b Just another text 2-3cd.e Some more text W12.34 Another item Z56.78 Yet another item Z56.78 And another!! Z56.7a And another!!! Some trash

Update: Cat walked on keyboard as I was posting. Please advise if you detect paw marks.


Comment on Re: Insert newline
Download Code
Re^2: Insert newline
by Anonymous Monk on Sep 15, 2011 at 14:48 UTC
    Thank you guys for your great help! I tested your advices on the "big file". More important, I seem to understand some of your code :-)
    The patterns
    /^(\d\-\d\w{2}(?:\.\w+)?|[A-Z]\d{2}(?:\.\d+)?)/
    must have the second part
    (?:\.\w+)? resp. (?:\.\d+)?)
    since there are some items ID as just M32 or 6-317
    It seems that I cannot use grep length ... and splice ... with the "?"-part since the item text will be cut in pieces. Perhaps I do not notice something (I am a novice in perl) since I learned the grep length construction only now after reading your code.
    There are some "trash lines" with parenthesis so that the title line can be distinguished as it has parenthesis part at the end only.
    By now the code of thewebsi seems to work the best with the file. There are still some trash lines in the output but comparatively few and the can be filtered out by data content.
    Thank you all again - for the code and for the class hour!
    VE

      The patterns (...) must have the second part

      (?:\.\w+)? resp. (?:\.\d+)?)

      That's easy, just add the 'second part':

      my $pat1 = qr '(\d-\d\w{2}(?:\.\w+)?)'; my $pat2 = qr '([A-Z]\d{2}(?:\.\d+)?)';

      Concerning grep length, its purpose is simply to filter out empty (and undef) items from the list created by splitting the line on the /$pat1|$pat2/ regex.

      hth, dave

      Update: Concerning the title line, you say that it 'can be distinguished as it has parenthesis part at the end only'. My tentative regex provides for this. But you also say 'There are some "trash lines" with parenthesis'. Well, what if these "trash lines" actually end with a parenthesised item, eg:

      Rabbit rabbit rabbit (rabbit!)

      What I meant by 'Better regex' was something to replace the second .+ by something that matches the ID code(?) of your titles. The only two examples you give are B23-9 and A12-3, so perhaps /[A-Z]\d{2}-\d/ would work. Otherwise, adjust accordingly.

        hi Dave,
        thank you for your answer.
        I have just tried your advice again - on another computer and without the "big file" - the patterns with the "second part" work now (I added two lines with text without dot in DATA below). Perhaps I made a typo a couple of hours earlier. I only modified the patterns with \b (s. below) since otherwise it matched 12-22a **This is trash*** too.
        I'll try this tomorrow again with the "big file" and report here. Thank you very much again.
        VE
        use strict; use warnings; use 5.010; my $pat1 = qr '(\b\d-\d\w{2}(?:\.\w+)?)'; my $pat2 = qr '(\b[A-Z]\d{2}(?:\.\d+)?)'; my $title; while ( <DATA> ) { chomp; if ( /(.+)(\(.+\))$/ ) { # Better regex for 'title' lines?? $title = "$1$2;"; } else { next unless /$pat1|$pat2/; my @items = grep length, split /$pat1|$pat2/; say $title, splice @items, 0, 2 while @items; } } __DATA__ Titel Text (A12-3) 3-123.7 Just another (small) text 3-123.8 Some more text 1-234 Text without dot 2-345 More text without dot A35 Another text without dot A12.34 Another item B56.78 Yet another item Another Titel Text (B23-9) Some trash here 12-22a **This is trash*** 1-22a.b Just another text 2-3cd.e Some more text W12.34 Another item Z56.78 Yet another item Z56.78 And another!! Z56.7a And another!!! Some trash
        Still testing (now after finding my kennword at last ...).

        Your pattern works fine, it was my paw marks yesterday, not the ones of your cat.

        I noted however that some lines are not matched - and the matter seem to be utf8 signs in the text after the item ID. How this could matter in this script - I do not understand yet. But if I delete these signs in the DATA snippet, the lines are matched then.

        Since I am really a novice in perl I just made a workaround - loaded the file into the MS Word, substituted the signs and saved as a txt again. It ran ok then.

        I am still testing because this is a really big file with many thousands table IDs and resp. a lot of items-ID.

        Many thanks! VE

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://925978]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (5)
As of 2014-12-25 17:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (161 votes), past polls