Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

parsing question

by Washie101 (Novice)
on May 28, 2003 at 07:53 UTC ( #261248=perlquestion: print w/replies, xml ) Need Help??

Washie101 has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys,

I have a parsing teaser that i need to get fixed.

Here goes.
heres are two lines of code the first one i need to remove
the comment at the end the second one needs to be ignored.


1. code <xyzfdgfghgf> ;strip me
2. code <<HTML>;nbsp dont strip me</HTML>>


basically i want to search and trim (0+ spaces);(0+chars) AFTER the last >


i need to have it in the format
if($line=~pattern)
{
$line =~new Trimmed comment pattern ;
}


can anyone help?

J

Replies are listed 'Best First'.
Re: parsing question
by kilinrax (Deacon) on May 28, 2003 at 08:50 UTC
    Unfortunately your question isn't terribly clear, so I'm not entirely sure what you're looking for.
    However, one thing I would suggest - if you want to match after the last occurance of something, it may be easier to apply a regex to a reversed string, e.g:
    my $reverse = reverse $line; $reverse =~ s| \w* ; \s* > |>|x; $line = reverse $reverse;

      or maybe a greedy regex?

      $line =~ s/^(.*>) ;.*/\1;/;

      He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life.

      Chady | http://chady.net/

        In a word, no.

        Reversing the regex is much faster.
        Have a look at these benchmarks:

        #!/usr/bin/perl -w use strict; use Benchmark; my $string = "<<HTML>;nbsp dont_strip_me</HTML>> <xyzfdgfghgf> ;strip_ +me"; sub reversed { my $reverse = reverse(shift); $reverse =~ s| \w* ; \s* > |>|x; return scalar reverse $reverse; } sub greedy { my $line = shift; $line =~ s|^ (.*>) \s* ; \w* |$1|x; return $line; } print "Reversed: ", reversed($string), "\n"; print "Greedy: ", greedy($string), "\n"; timethese( -10,{ reversed => sub { reversed( $string ) }, greedy => sub { greedy( $string ) }, } );

        Output:

        Reversed: <<HTML>;nbsp dont_strip_me</HTML>> <xyzfdgfghgf>
        Greedy: <<HTML>;nbsp dont_strip_me</HTML>> <xyzfdgfghgf>
        Benchmark: running greedy, reversed, each for at least 10 CPU seconds...
            greedy: 10 wallclock secs ( 9.98 usr + 0.02 sys = 10.00 CPU) @ 78480.80/s (n=784808)
          reversed: 11 wallclock secs (10.46 usr + 0.00 sys = 10.46 CPU) @ 167660.04/s (n=1753724)

        As you can see, it's over twice the speed. On longer strings, the difference would be even greater.

        Also, your regex is wrong. Read through perldoc:perlre (specifically, the section marked 'Warning on \1 vs $1') to discover why.

Re: parsing question
by Zaxo (Archbishop) on May 28, 2003 at 12:54 UTC

    Concentrating on

    basically i want to search and trim (0+ spaces);(0+chars) AFTER the last >
    as the actual requirement:
    substr( $line, rindex( $line, '>')) =~ s/\s*;\w*//; # typo corrected, s/:/;/ in the regex
    That has a certain amount of magic in it that I should explain. The substr function is an lvalue, meaning that the string of its first argument is modifiable through it. The rindex function finds the last '>' in $line, making substr deal with only the portion of $line that follows that position. Effectively, the substitution is restricted to the part of $line that you specified.

    After Compline,
    Zaxo

Re: parsing question
by TomDLux (Vicar) on May 28, 2003 at 12:58 UTC

    What is generating this data?

    • The first has a single HTML tag, with no closing tag, while the second has opening and closing tags.
    • This first has a non-standard tag, while the second has valid HTML tags.
    • The second has the chunk enclosed in angle brackets. is that what makes it acceptable?
    • Does it matter that your tags are only acceptable HTML 4? XHTML requires lower case tags.
    • Since HTML documents are not line-oriented, breaks can occur anywhere, or many components can be one one line. Is that relevant to your document?
Re: parsing question
by Wonko the sane (Deacon) on May 28, 2003 at 13:49 UTC
    I like kilinrax use of reverse, I have never seen that trick before.
    Without knowing that I would have suggested a capturing regex,
    sort of a modification of the greedy suggestion.

    It benchmarks the fastest of the three.

    #!/usr/local/bin/perl use strict; use Benchmark; my $string = "<<HTML>;nbsp dont_strip_me</HTML>> <xyzfdgfghgf> ;strip_ +me"; sub reversed { my $reverse = reverse(shift); $reverse =~ s| \w* ; \s* > |>|x; return scalar reverse $reverse; } sub greedy { my $line = shift; $line =~ s|^ (.*>) \s* ; \w* |$1|x; return $line; } sub capture { my $line = shift; return $line =~ /^(.+>)/; } print "Reversed: ", reversed($string), "\n"; print "Greedy: ", greedy($string), "\n"; print "Capture: ", capture($string), "\n"; timethese( -10,{ reversed => sub { reversed( $string ) }, greedy => sub { greedy( $string ) }, capture => sub { capture( $string ) }, } );
    Output:
    :!./test.pl Reversed: <<HTML>;nbsp dont_strip_me</HTML>> <xyzfdgfghgf> Greedy: <<HTML>;nbsp dont_strip_me</HTML>> <xyzfdgfghgf> Capture: <<HTML>;nbsp dont_strip_me</HTML>> <xyzfdgfghgf> Benchmark: running capture, greedy, reversed, each for at least 10 CPU + seconds... capture: 10 wallclock secs (10.40 usr + 0.01 sys = 10.41 CPU) @ 53 +160.52/s (n=553401) greedy: 10 wallclock secs (10.52 usr + 0.00 sys = 10.52 CPU) @ 21 +887.07/s (n=230252) reversed: 11 wallclock secs (10.54 usr + 0.01 sys = 10.55 CPU) @ 36 +366.92/s (n=383671)

    Wonko

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://261248]
Approved by graff
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (4)
As of 2022-05-18 20:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (71 votes). Check out past polls.

    Notices?