Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Need a regex to replace incomplete html entities

by Chris Daniel (Novice)
on Nov 20, 2016 at 10:30 UTC ( #1176192=perlquestion: print w/replies, xml ) Need Help??

Chris Daniel has asked for the wisdom of the Perl Monks concerning the following question:

Hi Guys,

Just need a help in regex. I have xml file, but unfortunately some of xml tags consists of incomplete html entities. I was looking for a regex that would remove the incomplete html entities.

For eg: <Remarks>1 SW PLT SLAC 6 PCS &#3</Remarks>

In above example, &#3 is incomplete html entity of &#38; So I need to remove such incomplete data.

Expected output:

<Remarks>1 SW PLT SLAC 6 PCS </Remarks>
I have tried to apply the sed command, but it removes when the pattern is matched.

I basically want to replace the string &,&#,&#3,&#38</c> to blank, but it should not replace </c>&#38;

Please help me.

Replies are listed 'Best First'.
Re: Need a regex to replace incomplete html entities
by haukex (Chancellor) on Nov 20, 2016 at 18:10 UTC
      > are they related?

      That's obviously the same person.

      IIRC it's official policy here to block multiple accounts.

      update

      ... or at least strike voting rights (?)

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

        If you're so certain that the two accounts are run by the same person, maybe you care to share the additional deep knowledge you have?

        Another, very trivial explanation could be that two different persons got the same homework tasks, or that this task has become a common hiring question, or that some common "data provider" (say, Wikipedia, or Amazon or Wordpress) produces such broken HTML since the release of a new version and that they are trying to solve the same problem.

        Hi LanX,

        IIRC it's official policy here to block multiple accounts.

        Not sure if it's up-to-date, but Site Rules Governing User Accounts appears to document the policy (multiple accounts are ok but only one can vote).

        Regards,
        -- Hauke D

Re: Need a regex to replace incomplete html entities
by tybalt89 (Parson) on Nov 20, 2016 at 15:49 UTC
    echo 'one & two &# three &#3 four &#38 five &#38; six' | perl -pe 's/& +(#(3(8(;\K)?)?)?)?//g'

    hehehe

      Oops. Misread the output.

      Nice!

      And, for windows, change all the single quotes to doubles.

      C:\>echo "one & two &# three &#3 four &#38 five &#38; six" | perl -pe +"s/&(#(3(8(;\K)?)?)?)?//;" "one two &# three &#3 four &#38 five &#38; six"
Re: Need a regex to replace incomplete html entities
by Athanasius (Bishop) on Nov 20, 2016 at 10:56 UTC

    Hello Chris Daniel, and welcome to the Monastery!

    Please note that your requirement as stated is self-contradictory. I assume you don’t want to remove the entity &#38. In which case, you can use a negative lookahead assertion to ensure that the entity to be removed is not immediately followed by another digit:

    20:54 >perl -wE "my $x = '<Remarks>1 SW PLT SLAC 6 PCS &#3</Remarks>'; + $x =~ s/&#\d(?!\d)//g; say $x;" <Remarks>1 SW PLT SLAC 6 PCS </Remarks> 20:54 >perl -wE "my $x = '<Remarks>1 SW PLT SLAC 6 PCS &#38</Remarks>' +; $x =~ s/&#\d(?!\d)//g; say $x;" <Remarks>1 SW PLT SLAC 6 PCS &#38</Remarks> 20:54 >

    See “Lookaround Assertions” in perlre#Extended-Patterns.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Need a regex to replace incomplete html entities
by Laurent_R (Canon) on Nov 20, 2016 at 10:52 UTC
    I basically want to replace the string &,&#,&#3,&#38 to blank, but it should not replace &#38;
    This seems to be contradictory: you want to replace &#38, but don't want to do it.

    Please explain.

    Update: added code tags because some characters were dropped from &#38 rendering the post difficult to understand.

      I am looking like if & is not followed by #38; then replace the & to blank.
      If a line consist of & or &# or &#3 or &#38 should be replaced to blan +k but &#38; should not be affected.
      Note: File is 200+ MB so thinking to apply sed command.
        If I understand you correctly, the important difference is the semi-colon: you want to replace &#38, but not if it is followed by a semi-colon (i.e. you don't want to replace &#38;). The poor formatting in your post made it difficult to understand that.

        The easy solution is to use a negative look-ahead, as already suggested in other posts, but I doubt that sed supports look-ahead assertions (it may depend which version).

        Besides, even for a 200 MB file, this should not be a problem in Perl. Last time I compared the performance of Perl and sed, I did not find a really significant performance difference between them, but, again, this may depend on the implementation of the sed version you're using.

Re: Need a regex to replace incomplete html entities
by LanX (Archbishop) on Nov 20, 2016 at 11:02 UTC
    I can't test but I think you need (?<!pattern) - a negative look-behind assertion, to exclude a following semi colon, while matching the start.

    See Using Look-ahead and Look-behind

    NB: this won't be xml aware!

    update

    Look ahead not look behind, Sorry I confused perspectives. ;)

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

      Hello LanX,

      I missed the semicolon because of the question’s formatting, but I don’t think you need a look-behind to handle it; a simple look-ahead seems to be enough:

      21:08 >perl -wE "my $x = '<Remarks>1 SW PLT SLAC 6 PCS &#3;</Remarks>' +; $x =~ s/&#\d(?!\d);?//g; say $x;" <Remarks>1 SW PLT SLAC 6 PCS </Remarks> 21:08 >perl -wE "my $x = '<Remarks>1 SW PLT SLAC 6 PCS &#38;</Remarks> +'; $x =~ s/&#\d(?!\d);?//g; say $x;" <Remarks>1 SW PLT SLAC 6 PCS &#38;</Remarks> 21:09 >perl -wE "my $x = '<Remarks>1 SW PLT SLAC 6 PCS &#3</Remarks>'; + $x =~ s/&#\d(?!\d);?//g; say $x;" <Remarks>1 SW PLT SLAC 6 PCS </Remarks> 21:09 >

      Or am I still missing something?

      Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

        No I already corrected in the mean time.

        Only work and no play makes LanX a dull boy 

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

        This looks perfect. But I need to remove the entities without opening the file. Using the unix sed command would be great... Any other suggestion to replace the string without opening the file would be great...
Re: Need a regex to replace incomplete html entities
by LanX (Archbishop) on Nov 21, 2016 at 12:46 UTC
    Another approach:

    > I basically want to replace the string &,&#,&#3,&#38 to blank, but it should not replace &#38;

    This meets your "requirements" and is IMHO easier to understand and more intuitive than tybald89's solution

    DB<19> p $test &,&#,&#3,&#38,&#38;&,&#,&#3,&#38,&#38; DB<20> p $test =~ s/(&#\d+;)|&#?\d*/$1/gr ,,,,&#38;,,,,&#38;

    The trick is to first match correct entities and leave them unchanged by replacing them with themselves.

    Incorrect entities are then found by backtracking and replaced with an empty $1.

    Please handle this with care, I'm not sure if your requirements didn't miss edge cases.

    NB: yes, it will also replace &38 without #

    DB<22> p ',&38,' =~ s/(&#\d+;)|&#?\d*/$1/gr ,,

    otherwise you can add other or-conditions to exclude this case.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

    update

    replaced s/(&#?\d*;)|... with s/(&#\d+;)|

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1176192]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2019-09-19 02:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The room is dark, and your next move is ...












    Results (238 votes). Check out past polls.

    Notices?