Re: Removing characters
by Dominus (Parson) on Jan 09, 2001 at 02:10 UTC
|
What is really going on here is that the output from man
is designed to be printed on a paper printer. When the
output includes something like x^Hx it's because
man wants the printer to back up and overprint the x
a second time, to render the x in boldface.
It's a big mistake to remove the ^H characters first and
then to try to strip the doubled letters, because by removing the ^Hes,
you've thrown away the information about which double letters
should be stripped. What you want is to take the output
from man and simply do this:
$text =~ s/.\cH//g;
That's all. nroff also uses _^Hx to
indicate an underlined x character, and this
will fix that also.
The other right solution was the one that mr.nick
suggested: Use the col -b command to filter out the
reverse motions before saving the output.
| [reply] [d/l] |
Re: Removing characters
by eg (Friar) on Jan 09, 2001 at 01:36 UTC
|
If you're redirecting output from man, you don't want to remove the ^H first, it's valuable information. Something like:
% man ls | perl -pe 's/.^H//g;' > foo
will remove both the ^H and the duplicate character at the same time.
| [reply] [d/l] |
|
If you are really trying to get plain-text output from man, you should use the standard method of doing so:
$ man ls | col -b > man-ls.txt
| [reply] [d/l] |
|
| [reply] |
Re: Removing characters
by j.a.p.h. (Pilgrim) on Jan 09, 2001 at 01:35 UTC
|
The only problem with that, is what if the word is supposed to have 2 of the same letter next to each other? In this silly language of ours, that's rather common. However, there are some letter which probably won't be paired. Y, Q and X for example. But with all the rest, you may want to think of another way to do it.
Note: I'm too tired and hungry to help with any actual code, unfortunately. | [reply] |
Re: Removing characters
by chipmunk (Parson) on Jan 09, 2001 at 01:56 UTC
|
Okay, I see some big disadvantages to this approach.
First, the regex is very displeasing. A good way to remove duplicate letters would be:
s/([a-z])\1/$1/ig) {
Take advantage of capturing parens and back-references!
More seriously, however, is that your code will match on such text as "this too shall pass". How do you determine which duplicates you want to remove and which you don't?
Any solution which removes the backspaces in one step and the duplicates in the second step is doomed to failure.
Here's a better approach:
#!perl -p
s/.\cH//g;
However, that won't work properly if there are multiple backspaces in a row.
A more generic and robust solution would be to use col. (Shameless plug. :) | [reply] [d/l] [select] |
|
"ignore backspaces" while s/[^\cH]\cH//g;
will work for runs of backspaces, BTW.
-
tye
(but my friends call me "Tye") | [reply] [d/l] |
Re: Removing characters
by chromatic (Archbishop) on Jan 09, 2001 at 01:33 UTC
|
Identifying words with valid double characters makes this trickier. I'd start with something like:
while (<INFILE>) {
s/([A-Za-z])\1/$1/g;
}
That's not an ideal solution, but it's a brute-force 80 percenter. | [reply] [d/l] |
Re: Removing characters
by cat2014 (Monk) on Jan 09, 2001 at 01:36 UTC
|
This is not a perl solution, but if you just don't want the
double letters to appear, and it's a perldoc which you're looking
at, just do:
perldoc -t Some::Module
the -t tells it to output in plain text. Unfortunately, the
man on my system doesn't have such an option, so this might
not really help you.
Try reading this thread: removing duplicate letters | [reply] |
(tye)Re: Removing characters
by tye (Sage) on Jan 09, 2001 at 03:29 UTC
|
Just for fun, I'll assume that I got the output from "man" and someone else has removed the backspaces for me so I want to remove the duplicates... And based on my experience, I may end up with tripled or quadrupled letters as well.
You could do pretty well by finding "words" that contain only doubled letters and modifying those. But a "good" way to do this hasn't popped into my brain yet...
my $inWord= "[-\\w'(),]";
my $notWord= "[^-\\w'(),]";
s#($notWord)(($inWord)\3(?:$inWord)*($inWord)\4)($notWord)#
my( $pre, $word, $post )= ( $1, $2, $5 );
my $len= length($word);
for( $word =~ /(.)(\1*)/g ) {
$len= length($2) if length($2) < $len;
}
$word =~ s/(.)\1{$len}/$1/g if 0 < $len;
$pre . $word . $post;
#ge
Like I said, that doesn't seem like a great way to do it (untested as well). :-}
-
tye
(but my friends call me "Tye") | [reply] [d/l] |
|
Here's another way of doing it, inspired by your code:
my $word = q{-\w'(),};
s{
(^|[^$word])
(
(?:([$word])\3)+
)
(?=[^$word]|$)
}{
my @x = ($1, $2);
$x[1] =~ s/(.)./$1/g;
join '', @x;
}xige;
}
That matches a complete "word" that consists entirely of doubled characters. The "word" is in $2; $1 holds the preceeding character. In the replacement, since I know that $2 contains only doubled characters, I just delete every other character. (Tested.)
| [reply] [d/l] |
Re: Removing characters
by lemming (Priest) on Jan 09, 2001 at 01:36 UTC
|
Well, I think the col -b solution was the best solution, but one
style of regex for your purpose could be:
s/([a-zA-Z])\1/$1/g; #happy now
However this will clobber lines like this one. Also
we are not considering non-alpha at all.
Update: And eg has the perl solution that
will work. (Could of sworn that's what I said in the CB)
YAU: I'm assuming the downvote came from
my use of \1 instead of of $1. (Lots of vi editing always
does that to me.) If not, please /msg me and enlighten me.
For those that are wondering about the \1 & $1: Inside
a match use the \1 backreference. Outside use the $1
notation. This doesn't wind up being very important unless
you've got more than nine backreferences. \10 is shorthand
for \010 which is octal. So if you have 10 or more matches
\10 will be the tenth match, otherwise it's octal 10. | [reply] [d/l] |
Re: Removing characters
by TStanley (Canon) on Jan 09, 2001 at 01:54 UTC
|
Update:
My thanks goes out to you all for your solutions. eg had the solution that worked.
TStanley | [reply] |
Re: Removing characters
by mp3car-2001 (Scribe) on Jan 09, 2001 at 04:04 UTC
|
Ok, I'm not sure if any of these are really a best answer. Try using man2html (its on my linux boxen anyway) to convert them to html, and if need be convert the HTML to plain text later. I'm not sure how to make man2html work, but thats what its manpage is for :)
Joe | [reply] |