Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

wiki regex reprocessing replacement

by LanX (Archbishop)
on Feb 15, 2020 at 14:39 UTC ( #11112988=perlquestion: print w/replies, xml ) Need Help??

LanX has asked for the wisdom of the Perl Monks concerning the following question:

Hi

Task

I need a regex to transform wiki markup surrounding words to html, * to <b> etc.

my problem is that */_ could be combined at word boundaries, see the following example

DB<66> $_=$wiki; tf();tf();tf() ; print "'$wiki' \n=>\n'$_'" '_*one /two/*_ _*three /four/*_ _*five /six/*_' => '<u><b>one <i>two</i></b></u> <u><b>three /four/</b></u> <u><b>five <i +>six</i></b></u>' DB<67>
'_*one /two/*_ _*three /four/*_ _*five /six/*_'
=>
'one two three /four/ five six'

as you can see I have to run the tf() transformation thrice

DB<40> %h = ( '*'=>'b', '/' => 'i' , '_' => 'u' ) DB<59> sub tf { s{ $pre ([_*/]) (.*?) \2 $post}{$1<$h{$2}>$3</$h{$2} +>$4}xg } DB<62> $pre = qr/(^|\s|>)/ DB<63> $post = qr/($|\s|<)/ DB<65> $wiki='_*one /two/*_ _*three /four/*_ _*five /six/*_'

Question

Is there a way to make it a one-run transformation?

Trouble is that /g continues after the inserted replacement, here underline

I was experimenting with lookaround-assertions and \G and couldn't get it done.

Approaches

The only ways I can (theoretically) think of so far are

  • to loop over /g in scalar context while (s///g) { ... } and to manipulate pos
  • or to manipulate pos in an embedded Perl code (?{...})
  • to call tf() recursively in the /e evaled replacement part
NB: It's a more theoretical question because running tf() three times doesn't pose problems.

UPDATE:

I just noticed a bug, since four wasn't expanded.

&tf has to be better written with a lookbehind which doesn't consume the next whitespace

DB<90> sub tf { s{ $pre ([_*/]) (.*?) \2 (?=$post)}{$1<$h{$2}>$3</$h +{$2}>}xg }

I'll update an SSCCE soon.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

Replies are listed 'Best First'.
Re: wiki regex reprocessing replacement
by tybalt89 (Parson) on Feb 15, 2020 at 18:08 UTC
    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11112988 use warnings; my $wiki = '_/one *two*/ th/ree_ null _/four *five*/ six_ null _/se_ven *eig +ht*/ nine_'; my $expected = '<u><i>one <b>two</b></i> th/ree</u> null <u><i>four <b>five</b></ +i> six</u> null <u><i>se_ven <b>eight</b></i> nine</u>'; my %h = ( '*' => 'b' , '/' => 'i' , '_' => 'u' , ); my $html = $wiki =~ s{ (?:^|\s) \K ([*_/]+) | ([*_/]*) (?=$|\s) } { $1 ? $1 =~ s|.|<$h{$&}>|gr : $2 =~ s|.|</$h{$&}>|gr }gexr; print $html eq $expected ? "passed" : "FAILED", "\n\n"; print $wiki, "\n\n", $expected, "\n\n", $html, "\n";

    Outputs;

    passed _/one *two*/ th/ree_ null _/four *five*/ six_ null _/se_ven *eight*/ +nine_ <u><i>one <b>two</b></i> th/ree</u> null <u><i>four <b>five</b></i> si +x</u> null <u><i>se_ven <b>eight</b></i> nine</u> <u><i>one <b>two</b></i> th/ree</u> null <u><i>four <b>five</b></i> si +x</u> null <u><i>se_ven <b>eight</b></i> nine</u>
      ah yes \K not \G I keep confusing them.

      And I thought that $1 and $2 are read-only ... ah I see you use the /r flag.

      anyway, markup should be paired.

      my $wiki = '_one*'; my $expected = $wiki; $html = $wiki =~ s{ (?:^|\s) \K ([*_/]+) | ([*_/]*) (?=$|\s) } { $1 ? $1 =~ s|.|<$h{$&}>|gr : $2 =~ s|.|</$h{$&}>|gr }gexr; print $html eq $expected ? "passed" : "FAILED", "\n\n"; print $wiki, "\n\n", $expected, "\n\n", $html, "\n";

      FAILED _one* _one* <u>one</b>

      I've updated the tests in Re: wiki regex reprocessing replacement (UPDATED^2) with markup to ignore

      Funny enough, the monastery fails too :)

      FAILED
      
      _one*
      
      _one*
      
      one  
      

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

      ) yes I know, wasn't explicitely tested

        Let it fail, it should be obvious in the preview.

        hehehe

Re: wiki regex reprocessing replacement
by AnomalousMonk (Bishop) on Feb 16, 2020 at 07:36 UTC

    Here's my take. One thing I don't understand is the inclusion of  > < characters in the pre- and post-markup tag delimiters (update: e.g.,  my $pre  = qr/(^|\s|>)/; here), probably because I'm not familiar with wikisyntax. Can you link me to a discussion of the role of these characters? I prepared two versions, one using  (?(DEFINE) ...) and one based purely on  qr// interpolation. Maybe one is faster, but I haven't done any Benchmark-ing (nor am I likely to).


    Give a man a fish:  <%-{-{-{-<

      Wow, thanks :)

      And the test suite ++

      > One thing I don't understand is the inclusion of > < characters in the pre- and post-markup tag delimiters

      Because the repetitive solution with tf() needs to ignore previous runs.

      */_word_/* -> <b>/_word_/</b> -> <b><i>_word_</i></b> -> etc.  

      The recursive solution with rec() doesn't really need it, one of the reasons why I prefer this approach.

      > probably because I'm not familiar with wikisyntax.

      No you are not wrong, there was information missing.

      In this particular case the syntax is also meant to coexist with more verbose html tags.

      There are cases where one doesn't want to have a whitespace in between neighboring tags.

      Just compare Re^3: Good Intentions: Wikisyntax for the Monastery and the complaint about 'ARGV'<br> not expanding.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        You're welcome.

        Here are some test cases I've added since posting. I'd be very interested to hear your comments, especially as regards the "questionable" ones.

        '--- tests added 16feb20 after pm#11113014 post ---', '"failing" (i.e., no transformation) tests', [ '' => '', ], [ '*' => '*', ], [ '*_/' => '*_/', ], [ ' * _ / ' => ' * _ / ', ], [ '*fail/' => '*fail/', ], [ ' * fail / ' => ' * fail / ', ], 'possibly questionable transformations', [ '__' => '<u></u>', ], [ ' __ ' => ' <u></u> ', ], [ '__ __' => '<u></u> <u></u>', ], [ ' __ __ ' => ' <u></u> <u></u> ', ], [ '____' => '<u></u><u></u>', '???' ], [ ' ____ ' => ' <u></u><u></u> ', '???' ], [ '______' => '<u></u><u></u><u></u>', '???' ], [ ' ______ ' => ' <u></u><u></u><u></u> ', '???' ], [ '________' => '<u></u><u></u><u></u><u></u>', '???' ], [ ' ________ ' => ' <u></u><u></u><u></u><u></u> ', '???' ], [ '__ __ __ __' => '<u></u> <u></u> <u></u> <u></u>', ], [ ' __ __ __ __ ' => ' <u></u> <u></u> <u></u> <u></u> ', ],
        In this particular case the syntax is also meant to coexist with more verbose html tags.
        There are cases where one doesn't want to have a whitespace in between neighboring tags.
        Can you supply some test cases for variations, especially WRT intermixtures with standard HTML?


        Give a man a fish:  <%-{-{-{-<

Re: wiki regex reprocessing replacement
by LanX (Archbishop) on Feb 15, 2020 at 15:19 UTC
    > I'll update an SSCCE soon.

    here we go

    (update: please see Re: wiki regex reprocessing replacement (UPDATED^2) for better testcases including wrong markup)

    use strict; use warnings; use Data::Dump qw/pp dd/; use Test::More; my $wiki = '_/one *two*/ three_ null _/four *five*/ six_ null _/seven *eight +*/ nine_'; my $expected = '<u><i>one <b>two</b></i> three</u> null <u><i>four <b>five</b></i +> six</u> null <u><i>seven <b>eight</b></i> nine</u>'; my $pre = qr/(^|\s|>)/; my $post = qr/($|\s|<)/; my %h = ( '*' => 'b' , '/' => 'i' , '_' => 'u' , ); sub tf { s{ $pre ([_*/]) (.*?) \2 (?=$post)}{$1<$h{$2}>$3</$h{$2}>}xg +}; $_=$wiki; my $DBG = 1; diag "IN <= '$wiki'\n\n" if $DBG; for my $i (1..3) { tf(); diag "$i: '$_'\n\n" if $DBG; } is($_,$expected," repeated replace works"); done_testing;

    # IN <= '_/one *two*/ three_ null _/four *five*/ six_ null _/seven *e +ight*/ nine_' # # 1: '<u>/one *two*/ three</u> null <u>/four *five*/ six</u> null <u> +/seven *eight*/ nine</u>' # # 2: '<u><i>one *two*</i> three</u> null <u><i>four *five*</i> six</u> + null <u><i>seven *eight*</i> nine</u>' # # 3: '<u><i>one <b>two</b></i> three</u> null <u><i>four <b>five</b></ +i> six</u> null <u><i>seven <b>eight</b></i> nine</u>' # ok 1 - repeated replace works 1..1
    # IN <= '_/one *two*/ three_ null _/four *five*/ six_  null _/seven *eight*/ nine_'
    # 
    # 1: '/one *two*/ three null /four *five*/ six  null /seven *eight*/ nine'
    # 
    # 2: 'one *two* three null four *five* six  null seven *eight* nine'
    # 
    # 3: 'one two three null four five six  null seven eight nine'
    # 
    ok 1 -  repeated replace works
    1..1
    

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      Here's a non-recursive way which I think fits your criteria:

      use strict; use warnings; use Test::More; my $wiki = '_/one *two*/ three_ null _/four *five*/ six_ null _/seven *eight +*/ nine_'; my $expected = '<u><i>one <b>two</b></i> three</u> null <u><i>four <b>five</b></i +> six</u> null <u><i>seven <b>eight</b></i> nine</u>'; my %h = ( '*' => 'b' , '/' => 'i' , '_' => 'u' , ); my $DBG = 1; sub flip { my $s = shift; my $z = $h{$s}; $h{$s} = $z =~ /\// ? substr ($z, 1, 1) : "/$z"; return "<$z>"; } sub tf { diag "Pre: '$_'\n\n" if $DBG; s{([_*/])}{flip($1)}eg }; $_ = $wiki; diag "IN <= '$wiki'\n\n" if $DBG; tf(); is ($_, $expected, " repeated replace works"); done_testing;
        Many thanks, :)

        ... but ...

        The testsuite should have also included markup which must not be replaced

        My fault sorry, I thought it's obvious by the $pre and $post regex.

        The markup must come in pairs and be embraced by special word boundaries.

        (whitespace or other markup or tag-brackets or ... depending on pre/post)

        Hence a _ inside a word is forbidden, which makes sense for joined_identifiers .

        I've updated the tests in Re: wiki regex reprocessing replacement (UPDATED^2) with markup to ignore

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

Re: wiki regex reprocessing replacement (UPDATED^2)
by LanX (Archbishop) on Feb 15, 2020 at 16:55 UTC
    UPDATES:

    Expanded test case with wrong markup to be ignored

    • markup inside word
    • wrong pairs like *fail_

    > to call tf() recursively in the /e evaled replacement part

    seems to work well, see sub rec

    use strict; use warnings; use Data::Dump qw/pp dd/; use Test::More; my $wiki = '_/one *two*/ th/ree_ null _/f*ur *five*/ six_ null _/se_ven *eig +ht*/ nine_ *fail_'; my $expected = '<u><i>one <b>two</b></i> th/ree</u> null <u><i>f*ur <b>five</b></ +i> six</u> null <u><i>se_ven <b>eight</b></i> nine</u> *fail_'; my $pre = qr/(^|\s|>)/; my $post = qr/($|\s|<)/; my %h = ( '*' => 'b' , '/' => 'i' , '_' => 'u' , ); sub tf { s{ $pre ([_*/]) (.*?) \2 (?=$post)}{$1<$h{$2}>$3</$h{$2}>}xg +}; $_=$wiki; my $DBG = 0; diag "IN <= '$wiki'\n\n" if $DBG; for my $i (1..3) { tf(); diag "$i: '$_'\n\n" if $DBG; } is($_,$expected,"repeated replace"); my $rec_level=0; sub rec { my ($txt) = @_; my $DBG = 1; diag ++$rec_level ."< $txt" if $DBG; $txt =~ s{ $pre ([_*/]) (.*?) \2 (?=$post) }{ my $tag = $h{$2}; "$1<$tag>" . rec($3). "</$tag>" }xge; diag $rec_level-- . "> $txt\n" if $DBG; return $txt; } my $got_rec = rec($wiki); is($got_rec,$expected,"recursive replace"); done_testing;

    ok 1 - repeated replace
    # 1< _/one *two*/ th/ree_ null _/f*ur *five*/ six_  null _/se_ven *eight*/ nine_ *fail_
    # 2< /one *two*/ th/ree
    # 3< one *two*
    # 4< two
    # 4> two
    # 3> one two
    # 2> one two th/ree
    # 2< /f*ur *five*/ six
    # 3< f*ur *five*
    # 4< five
    # 4> five
    # 3> f*ur five
    # 2> f*ur five six
    # 2< /se_ven *eight*/ nine
    # 3< se_ven *eight*
    # 4< eight
    # 4> eight
    # 3> se_ven eight
    # 2> se_ven eight nine
    # 1> one two th/ree null f*ur five six  null se_ven eight nine *fail_
    ok 2 - recursive replace
    1..2
    

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11112988]
Approved by marto
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (7)
As of 2020-03-28 18:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    To "Disagree to disagree" means to:









    Results (167 votes). Check out past polls.

    Notices?