Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

How to remove wide chars (ex: 年 or &#x6216) from a text file?

by Anonymous Monk
on Mar 05, 2013 at 14:49 UTC ( [id://1021837]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks,

I have a question for you... How can I remove the below from a text file?

具三年或以上相關&#x5DE +5;作經驗 - 主要負責公司內之&# +x8336;水、清潔、雜務&#x7B49 +;工作 - 必須有責任感、整&# +x6F54;、待人有禮 - 需長時間工作

Thanks in Advance!

Replies are listed 'Best First'.
Re: How to remove wide chars (ex: 年 or &#x6216) from a text file?
by davido (Cardinal) on Mar 05, 2013 at 20:52 UTC

    Do you want to remove any code point that is outside the range of normal ASCII text? Do you want to remove any HTML entity that translates to a code point beyond the ASCII range? Both of those could leave big holes in your text.

    Or are you asking how to map code points that fall outside of the ASCII range into code points that fall within? If so, what mapping would you prefer? And how would you gracefully downgrade characters that don't have any obvious ASCII approximation?

    As you begin to consider and answer (for yourself and for us) those questions, a solution will emerge.


    Dave

Re: How to remove wide chars (ex: 年 or &#x6216) from a text file?
by graff (Chancellor) on Mar 06, 2013 at 02:54 UTC
    I agree with davido - your question is too vague. What I see in the OP are hexadecimal entity references for Chinese characters. Maybe you have entity references to other non-ASCII characters as well? If you just want to delete all such entity references:
    s/\&#x\w+;//g;
    (Note that the number of hex digits per character may vary.) In the case of the OP data sample, that regex leaves 7 lines that are either empty or contain just hyphens and/or spaces.
Re: How to remove wide chars (ex: 年 or &#x6216) from a text file?
by Anonymous Monk on Mar 05, 2013 at 15:28 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1021837]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (5)
As of 2024-03-19 07:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found