Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Re^3: Insert newline

by Not_a_Number (Prior)
on Sep 15, 2011 at 18:07 UTC ( #926204=note: print w/replies, xml ) Need Help??

in reply to Re^2: Insert newline
in thread Insert newline

The patterns (...) must have the second part

(?:\.\w+)? resp. (?:\.\d+)?)

That's easy, just add the 'second part':

my $pat1 = qr '(\d-\d\w{2}(?:\.\w+)?)'; my $pat2 = qr '([A-Z]\d{2}(?:\.\d+)?)';

Concerning grep length, its purpose is simply to filter out empty (and undef) items from the list created by splitting the line on the /$pat1|$pat2/ regex.

hth, dave

Update: Concerning the title line, you say that it 'can be distinguished as it has parenthesis part at the end only'. My tentative regex provides for this. But you also say 'There are some "trash lines" with parenthesis'. Well, what if these "trash lines" actually end with a parenthesised item, eg:

Rabbit rabbit rabbit (rabbit!)

What I meant by 'Better regex' was something to replace the second .+ by something that matches the ID code(?) of your titles. The only two examples you give are B23-9 and A12-3, so perhaps /[A-Z]\d{2}-\d/ would work. Otherwise, adjust accordingly.

Replies are listed 'Best First'.
Re^4: Insert newline
by Anonymous Monk on Sep 15, 2011 at 19:37 UTC
    hi Dave,
    thank you for your answer.
    I have just tried your advice again - on another computer and without the "big file" - the patterns with the "second part" work now (I added two lines with text without dot in DATA below). Perhaps I made a typo a couple of hours earlier. I only modified the patterns with \b (s. below) since otherwise it matched 12-22a **This is trash*** too.
    I'll try this tomorrow again with the "big file" and report here. Thank you very much again.
    use strict; use warnings; use 5.010; my $pat1 = qr '(\b\d-\d\w{2}(?:\.\w+)?)'; my $pat2 = qr '(\b[A-Z]\d{2}(?:\.\d+)?)'; my $title; while ( <DATA> ) { chomp; if ( /(.+)(\(.+\))$/ ) { # Better regex for 'title' lines?? $title = "$1$2;"; } else { next unless /$pat1|$pat2/; my @items = grep length, split /$pat1|$pat2/; say $title, splice @items, 0, 2 while @items; } } __DATA__ Titel Text (A12-3) 3-123.7 Just another (small) text 3-123.8 Some more text 1-234 Text without dot 2-345 More text without dot A35 Another text without dot A12.34 Another item B56.78 Yet another item Another Titel Text (B23-9) Some trash here 12-22a **This is trash*** 1-22a.b Just another text 2-3cd.e Some more text W12.34 Another item Z56.78 Yet another item Z56.78 And another!! Z56.7a And another!!! Some trash
Re^4: Insert newline
by vagabonding electron (Chaplain) on Sep 16, 2011 at 13:22 UTC
    Still testing (now after finding my kennword at last ...).

    Your pattern works fine, it was my paw marks yesterday, not the ones of your cat.

    I noted however that some lines are not matched - and the matter seem to be utf8 signs in the text after the item ID. How this could matter in this script - I do not understand yet. But if I delete these signs in the DATA snippet, the lines are matched then.

    Since I am really a novice in perl I just made a workaround - loaded the file into the MS Word, substituted the signs and saved as a txt again. It ran ok then.

    I am still testing because this is a really big file with many thousands table IDs and resp. a lot of items-ID.

    Many thanks! VE

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://926204]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (3)
As of 2018-05-27 16:38 GMT
Find Nodes?
    Voting Booth?