如果一行包含专用区域字符，如何删除它？

$ cat textfile.txt | less 10 翴 30 <U+E4D1> ten-thirty in ... three ... two ... one . - 10 翴 45だи<U+E145>砆秂 <U+E18E> it 's a slam-dunk . <U+E707> 10 翴 <U+E6C4>ㄓ ? so you will be home by 10:00 ? 10 翴牧よ<U+E6BC>ㄓ<U+E5EC> bogey at 10 o'clock . - 10 翴牧よ<U+E6BC>い盠 - ten o'clock , lieutenant , 10 o'clock ! 10 翴牧よ<U+E6BC>绰玭 i see it , 8 o'clock , heading south ! 10 翴筁<U+E5EC> it 's past 10:00 . <U+E80B>ぱ 10 翴非<U+E1A0>筁ㄓ be here tomorrow , 10:00 sharp . - 10 ，老搭档有人开枪，疑犯拒捕 shots firing . suspect 's fleeing . - 1 -0 而已 - only 1-0 . - 1 -0 而已 - only 1-0 .

2条回答

网友

1楼 · 编辑于 2024-10-08 19:33:10

检查字符是否属于private use area的条件（ord(i) > 57344）不正确：

Currently, three private use areas are defined: one in the Basic Multilingual Plane (U+E000–U+F8FF), and one each in, and nearly covering, planes 15 and 16 (U+F0000–U+FFFFD, U+100000–U+10FFFD)

以下是修复的Python 3代码：

pua_ranges = ( (0xE000, 0xF8FF), (0xF0000, 0xFFFFD), (0x100000, 0x10FFFD) )

def is_pua_codepoint(c):
    return any(a <= c <= b for (a,b) in pua_ranges)

for line in open('test.txt', 'r'):
    if any(is_pua_codepoint(ord(i)) for i in line):
        pass
    else:
        print(line)

网友

2楼 · 编辑于 2024-10-08 19:33:10

此grep命令将匹配U+E000–U+F8FF范围内不包含PUA字符的任何行：

grep -Pv "[\xe0\x00-\xf8\xff]"

相关问题更多 >

编程相关推荐

热门问题

热门文章