Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

regex match unicode characters in ascii string

by 3dbc (Monk)
on Jan 27, 2017 at 18:17 UTC ( [id://1180486]=perlquestion: print w/replies, xml ) Need Help??

3dbc has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monastery,

Think this post would be helpful to anyone looking to strip extended ascii characters / unicode out of their pesky strings to expose the correct data for manipulation elsewhere in their code.

Would appreciate a little help with regex match on non-printable / extended ascii characters proliferated throughout my dirty little strings. I've these nasty little characters https://unicodelookup.com/#▼/1 in my strings and need to parse them out, but would like the solution to match any non-alpha numeric, not just ▼ characters if possible

I.E. Something like this

FYI, this is the $string below, when you put it in code blocks it shows up wrong

my $string ="Group: Group Name▼▼Role: Role Name"

my $string ="Group: Group Name▼▼Role: Role Name" $string =~ m/(.*)Group|Role\:\s+(.*)\s+|[^!-~\s]+(.*)$/; print "\n\nTHIS IS THE GROUP/ROLE NAME!!!!:" . $2 . "!!!!\n\n";
gives me an incorrect result of:


THIS IS THE GROUP/ROLE NAME!!!!:Group Name▼▼Role:!!!!



The unicode characters are in the strings as the upsidown triangle

Thanks,
- 3dbc

Replies are listed 'Best First'.
Re: regex match unicode characters in ascii string
by haukex (Archbishop) on Jan 27, 2017 at 19:54 UTC

    Hi 3dbc,

    If I'm understanding you correctly, you want to remove anything that's not printable ASCII?

    # Keep only printable ASCII plus CR, LF, TAB $string =~ tr/\x09\x0A\x0D\x20-\x7E//cd; # Keep only alphanumeric plus space $string =~ tr/A-Za-z0-9 //cd;

    Hope this helps,
    -- Hauke D

      Thank you!, I thought about this, but if I don't replace the extended ascii character with a space, how will i differentiate the group name with the role identifier?
      - 3dbc

        Hi 3dbc,

        replace the extended ascii character with a space
        $string =~ tr/\x09\x0A\x0D\x20-\x7E/ /c;

        See the documentation of tr/SEARCHLIST/REPLACEMENTLIST/cdsr under "Quote-Like Operators" in perlop.

        Hope this helps,
        -- Hauke D

Re: regex match unicode characters in ascii string
by kcott (Archbishop) on Jan 28, 2017 at 03:02 UTC

    G'day 3dbc,

    I see you have a solution to your posted question. This is just a "for-future-reference" comment on:

    "FYI, this is the $string below, when you put it in code blocks it shows up wrong"

    Using '<code>' tags is much preferred: you don't have to mess around changing '&' to '&amp;', '<' to '&lt;', and so on; and we can get a verbatim copy of your code or data via the '[download]' link. However, as you've experienced, there can be problems when using non-ASCII characters; in these cases, use '<pre>' instead of '<code>' tags.

    This uses '<code>' tags:

    Group Name&#9660;&#9660;Role

    This uses '<pre>' tags:

    Group Name▼▼Role
    

    — Ken

Re: regex match unicode characters in ascii string
by poj (Abbot) on Jan 27, 2017 at 20:00 UTC

    Alternatively just select what you want rather than deleting what you don't.

    my $string = "Group: Group Name▼▼Role: Role Name";
    while ( $string =~ /(Group|Role)\:\s+([\x00-\x7f]*)/g ){ print "$1 = $2\n"; };
      Thanks for posting, but I used that regex and it returned:

      Group:= Group Name▼▼Role: Role Name

      Trying to get all the group, role names into a group / role keys within a DBI $hashref which I already have, but want to add these key value pairs that have values without the Group: / Role: identifier and minus any of the extended ascii characters. Kind of cleaning up the values and organizing the data so I can work on it elsewhere.
      - 3dbc

        Ok . try this

        #!perl use strict; use HTML::Entities; use Data::Dump 'pp'; my $string = "Group: Group Name&#9660;&#9660;Role: Role Name"; $string = decode_entities($string); my @f=(); while ( $string =~ /(Group|Role)\:\s+([\x00-\x7f]*)/g ){ push @f,$2; }; pp \@f;
        poj

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1180486]
Approved by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (6)
As of 2024-04-16 19:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found