Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Removing characters

by TStanley (Canon)
on Jan 09, 2001 at 01:26 UTC ( #50567=perlquestion: print w/replies, xml ) Need Help??

TStanley has asked for the wisdom of the Perl Monks concerning the following question:

I was attempting earlier to redirect the output from a man page
to a file (easily done). I opened the file, and noticed that
two problems existed.
  1. There were ^H characters imbedded in the file
  2. Some of the words in the file had double characters. For example
    the word SYNOPSIS, appears as SSYYNNOOPPSSIISS.
I got rid of the ^H characters easily enough, but I was still stuck with
the double characters. I decided to write a small script that could handle
this problem.

#!/usr/bin/perl use strict -w; my $DOC="file.in"; my $OUTDOC="file.out"; open (INFILE,'$DOC'||die "Could Not Open $DOC"); open (OUTFILE,'>>$OUTDOC'||die "Could Not Open $OUTDOC"); while(<INFILE>) { if(/aa|bb|cc|dd|ee|ff|gg|hh|ii|jj|kk|ll|mm|nn|oo|pp|qq|rr|ss|tt|uu|v +v|ww|xx|yy|zz/i) { ???????????? print OUTFILE $_; } else { print OUTFILE $_; } } close INFILE; close OUTFILE;

Basically, what I would like to do, is for the script to read in the
line of a file, remove the extra characters from each word, and write
everything to the new file. If there are no double characters in the
line, it should just write the line to the output file.

TStanley
There can be only one!

Replies are listed 'Best First'.
Re: Removing characters
by Dominus (Parson) on Jan 09, 2001 at 02:10 UTC
    What is really going on here is that the output from man is designed to be printed on a paper printer. When the output includes something like x^Hx it's because man wants the printer to back up and overprint the x a second time, to render the x in boldface.

    It's a big mistake to remove the ^H characters first and then to try to strip the doubled letters, because by removing the ^Hes, you've thrown away the information about which double letters should be stripped. What you want is to take the output from man and simply do this:

    $text =~ s/.\cH//g;
    That's all. nroff also uses _^Hx to indicate an underlined x character, and this will fix that also.

    The other right solution was the one that mr.nick suggested: Use the col -b command to filter out the reverse motions before saving the output.

Re: Removing characters
by eg (Friar) on Jan 09, 2001 at 01:36 UTC

    If you're redirecting output from man, you don't want to remove the ^H first, it's valuable information. Something like:

    % man ls | perl -pe 's/.^H//g;' > foo

    will remove both the ^H and the duplicate character at the same time.

      If you are really trying to get plain-text output from man, you should use the standard method of doing so:

      $ man ls | col -b > man-ls.txt
Re: Removing characters
by j.a.p.h. (Pilgrim) on Jan 09, 2001 at 01:35 UTC
    The only problem with that, is what if the word is supposed to have 2 of the same letter next to each other? In this silly language of ours, that's rather common. However, there are some letter which probably won't be paired. Y, Q and X for example. But with all the rest, you may want to think of another way to do it.

    Note: I'm too tired and hungry to help with any actual code, unfortunately.

Re: Removing characters
by chipmunk (Parson) on Jan 09, 2001 at 01:56 UTC
    Okay, I see some big disadvantages to this approach.

    First, the regex is very displeasing. A good way to remove duplicate letters would be: s/([a-z])\1/$1/ig) { Take advantage of capturing parens and back-references!

    More seriously, however, is that your code will match on such text as "this too shall pass". How do you determine which duplicates you want to remove and which you don't? Any solution which removes the backspaces in one step and the duplicates in the second step is doomed to failure.

    Here's a better approach:

    #!perl -p s/.\cH//g;
    However, that won't work properly if there are multiple backspaces in a row.

    A more generic and robust solution would be to use col. (Shameless plug. :)

      "ignore backspaces" while s/[^\cH]\cH//g;

      will work for runs of backspaces, BTW.

              - tye (but my friends call me "Tye")
Re: Removing characters
by chromatic (Archbishop) on Jan 09, 2001 at 01:33 UTC
    Identifying words with valid double characters makes this trickier. I'd start with something like:
    while (<INFILE>) { s/([A-Za-z])\1/$1/g; }
    That's not an ideal solution, but it's a brute-force 80 percenter.
Re: Removing characters
by cat2014 (Monk) on Jan 09, 2001 at 01:36 UTC
    This is not a perl solution, but if you just don't want the double letters to appear, and it's a perldoc which you're looking at, just do:

    perldoc -t Some::Module

    the -t tells it to output in plain text. Unfortunately, the man on my system doesn't have such an option, so this might not really help you.

    Try reading this thread: removing duplicate letters

(tye)Re: Removing characters
by tye (Sage) on Jan 09, 2001 at 03:29 UTC

    Just for fun, I'll assume that I got the output from "man" and someone else has removed the backspaces for me so I want to remove the duplicates... And based on my experience, I may end up with tripled or quadrupled letters as well.

    You could do pretty well by finding "words" that contain only doubled letters and modifying those. But a "good" way to do this hasn't popped into my brain yet...

    my $inWord= "[-\\w'(),]"; my $notWord= "[^-\\w'(),]"; s#($notWord)(($inWord)\3(?:$inWord)*($inWord)\4)($notWord)# my( $pre, $word, $post )= ( $1, $2, $5 ); my $len= length($word); for( $word =~ /(.)(\1*)/g ) { $len= length($2) if length($2) < $len; } $word =~ s/(.)\1{$len}/$1/g if 0 < $len; $pre . $word . $post; #ge
    Like I said, that doesn't seem like a great way to do it (untested as well). :-}

            - tye (but my friends call me "Tye")
      Here's another way of doing it, inspired by your code:
      my $word = q{-\w'(),}; s{ (^|[^$word]) ( (?:([$word])\3)+ ) (?=[^$word]|$) }{ my @x = ($1, $2); $x[1] =~ s/(.)./$1/g; join '', @x; }xige; }
      That matches a complete "word" that consists entirely of doubled characters. The "word" is in $2; $1 holds the preceeding character. In the replacement, since I know that $2 contains only doubled characters, I just delete every other character. (Tested.)
Re: Removing characters
by lemming (Priest) on Jan 09, 2001 at 01:36 UTC
    Well, I think the col -b solution was the best solution, but one style of regex for your purpose could be:
    s/([a-zA-Z])\1/$1/g; #happy now
    However this will clobber lines like this one. Also we are not considering non-alpha at all.
    Update: And eg has the perl solution that will work. (Could of sworn that's what I said in the CB)
    YAU: I'm assuming the downvote came from my use of \1 instead of of $1. (Lots of vi editing always does that to me.) If not, please /msg me and enlighten me.

    For those that are wondering about the \1 & $1: Inside a match use the \1 backreference. Outside use the $1 notation. This doesn't wind up being very important unless you've got more than nine backreferences. \10 is shorthand for \010 which is octal. So if you have 10 or more matches \10 will be the tenth match, otherwise it's octal 10.

Re: Removing characters
by TStanley (Canon) on Jan 09, 2001 at 01:54 UTC
    Update:
    My thanks goes out to you all for your solutions. eg had the solution that worked.

    TStanley
Re: Removing characters
by mp3car-2001 (Scribe) on Jan 09, 2001 at 04:04 UTC
    Ok, I'm not sure if any of these are really a best answer. Try using man2html (its on my linux boxen anyway) to convert them to html, and if need be convert the HTML to plain text later. I'm not sure how to make man2html work, but thats what its manpage is for :)

    Joe

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://50567]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (3)
As of 2023-10-02 07:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?