http://www.perlmonks.org?node_id=11126825

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a large text file to process. When I view it with the 'more' filter I see a lot of these tags:

<91> <92> <93> <94> <97>
I can see from the context they are of a grammatical nature.

Is there a nice way to convert these tags into punctuation marks etc?

Replies are listed 'Best First'.
Re: convert tags to punctuation
by Fletch (Bishop) on Jan 13, 2021 at 04:08 UTC

    Those sound like the hex values used for old Microsoft "Smart" quotes where it tries to helpfully replace correct ASCII punctuation with what winds up being unreadable on other platforms. There's ancient code from 2003 that should still work, or you can work out what the correct corresponding character would be and fix just the things you're seeing (e.g. s/\x91/`/g I'm guessing) and run some variation of perl -i.bak -pE 's/...//' input.txt over your file(s).

    Edit: Just to be clear, it sounded like whatever you're viewing files with is displaying the non-ASCII character's hex values in angle brackets. If you've literally got four characters "< 9 2 >" then you'd want the above advice.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: convert tags to punctuation
by Polyglot (Chaplain) on Jan 13, 2021 at 01:33 UTC

    If you know which tag corresponds to which punctuation mark, it should be a cinch to convert each one via a substitution. Something like this should work:

    my $line = 'Text with unusual punctuation<91><91><91> I<92>m not goin +g to lie<93> this is odd text<94>'; $line =~ s/<91>/./g; $line =~ s/<92>/'/g; $line =~ s/<93>/,/g; $line =~ s/<94>/!/g; # etc.

    Or, if processing the entire file, instead of line by line, you could try it this way:

    my $source = 'my_filename.txt'; my $target = 'new_filename.txt'; #THIS FILE WILL BE OVERWRITTEN open SOURCE, "<$source" or die "Can't open $source. $!\n"; @array = <SOURCE>; close SOURCE; s/<91>/./g for @array; s/<92>/'/g for @array; s/<93>/,/g for @array; s/<94>/!/g for @array; open TARGET, ">$target" or die "Can't open $target. $!\n"; print TARGET @array; close TARGET;

    Blessings,

    ~Polyglot~

      As a variation on Polyglot's solution, you can define the tags in a hash. The advantage is that it is more easily expanded if more tags are needed. I have chosen to specify the characters by name (charnames) because I find single punctuation marks, embedded in quotes, hard to read.
      use strict; use warnings; my %tags = ( 91 => "\N{FULL STOP}", # '.' 92 => "\N{APOSTROPHE}", # ''' 93 => "\N{COMMA}", # ',' 94 => "\N{EXCLAMATION MARK}", # '!' ); my $line = 'Text with unusual punctuation<91><91><91>' .'I<92>m not going to lie<93> this is odd text<94>' ; $line =~ s/<(\d\d)>/$tags{$1}/ge; print $line, "\n";
      Bill

        Bill -- I think your code is more maintainable. The document I am messing with is about 600,000 lines long. Is there a way to speed this up? Is there a way to get a complete list of <ab> tags ?

Re: convert tags to punctuation
by hippo (Bishop) on Jan 13, 2021 at 09:16 UTC