http://www.perlmonks.org?node_id=458063

Elijah has asked for the wisdom of the Perl Monks concerning the following question:

Does anyone have a nifty way of using perl regexp to filter man page special characters such as ^H (CTRL-H) and to turn output such as _O_P_T_I_O_N into OPTION?

Now mind you the option above with underscores also has control characters in them but I could not post them here without screwing up the entire post. I mention this because a simple s/\_//g; will not do.

Replies are listed 'Best First'.
Re: Filter man page control characters?
by Thelonius (Priest) on May 18, 2005 at 04:42 UTC
    s/.\010//g;
        \010 is the octal escape for ^H. This regex deletes each backspace and the character just before it. With man, an underlined M is _^HM and a bold M is M^HM. Note that on impact printers, these control sequences work. The programs more and less know how to interpret these (more or less) for CRT terminals. But note that if you just cat these to a CRT terminal, you'll still see the letter. The underline sequence could be written as M^H_ and it would still work on a line printer, but would screw up a CRT terminal, so it is purposely not written that way.

        So the regex s/.\010//g is just telling perl to do what the backspace does: delete the character before the backspace.

Re: Filter man page control characters?
by mattk (Pilgrim) on May 18, 2005 at 06:09 UTC
    Incidentally, here is a handy bash alias to view your man pages in vim, complete with highlighting:
    alias man="man -P \"col -b | vim -R -c 'set ft=man nomod nolist' -\""
Re: Filter man page control characters?
by Zaxo (Archbishop) on May 18, 2005 at 04:46 UTC

    I think this requires a parser, not a simple regex. Have you tried running groff with -Tascii? You could call that from perl with piped open and get the asciified lines by reading the filehandle.

    After Compline,
    Zaxo

Re: Filter man page control characters?
by thcsoft (Monk) on May 18, 2005 at 04:46 UTC
    i don't know what you're intending, but i played a little, and this is what came out:
    cp /usr/doc/man/man1/aaxine.1.gz . gunzip aaxine.1.gz perl -e 'open FH, "aaxine.1" or die $!; my @xine = <FH>; close FH; map + { s/\cH//g } @xine; print @xine'
    the regex s/\cH//g filters CTRL-H at least...

    language is a virus from outer space.
Re: Filter man page control characters?
by TheStudent (Scribe) on May 18, 2005 at 04:50 UTC
    While not a solution, I believe the control characters you are referring to a back-spaces to allow _O_P_T_I_O_N to appear as OPTION

    Update: Are your man pages already processed in some way? The source for man pages on my system have formatting codes rather than control characters.

    REs are definately not my strong point, but, if you only need to deal with the control H's and backspaces, how about after reading the man file into an array (i.e., @manlines) and then using something like:

    map { s/\cH|[\b]//g } @manlines;
Re: Filter man page control characters?
by runrig (Abbot) on May 18, 2005 at 16:51 UTC
    No need for a regexp (though fine as an exercise). Just use "col -b". E.g.
    man some_thing | col -b >some.file