Python:从字符串中删除特定字符（u“\u2610”）

for work in glob.glob(pathtofiles): openfile = open(work) readfile = openfile.read() stringfile = str(readfile) decodefile = stringfile.decode('utf-8', 'strict') #is this the dodgy line? soup = BeautifulSoup(decodefile) textwithtags = soup.findAll('text') textwithtagsasstring = str(textwithtags) #this method strips everything between anglebrackets as it should textwithouttags = stripTags(textwithtagsasstring) #clean text nonewlines = textwithouttags.replace("\n", " ") noextrawhitespace = re.sub(' +',' ', nonewlines) print noextrawhitespace #the boxes appear

3条回答

网友

1楼 · 编辑于 2024-09-28 05:19:11

在阅读示例时，以下是文档中的非ASCII字符：

0x2223 DIVIDES
0x2022 BULLET
0x3009 RIGHT ANGLE BRACKET
0x25aa BLACK SMALL SQUARE
0x25ca LOZENGE
0x3008 LEFT ANGLE BRACKET
0x2014 EM DASH
0x2026 HORIZONTAL ELLIPSIS

\u2223是第3682行中的实际字符，它被用作软连字符。其他字符用于标记难以辨认的字符，例如：

^{pr2}$

这里有一些代码可以执行您的代码尝试的操作。请确保使用Unicode进行处理：

from bs4 import BeautifulSoup
import re

with open('k000039.000.xml') as f:
    soup = BeautifulSoup(f)  # BS figures out the encoding

text = u''.join(soup.strings)      # strings is a generator for just the text bits.
text = re.sub(ur'\s+',ur' ',text)  # Simplify all white space.
text = text.replace(u'\u2223',u'') # Get rid of the DIVIDES character.
print text

输出：

[[truncated]] reckon my self a Bridegroom too. Buckle. I doubt Kickey won't find him such. [Aside.] Mrs. Sago. Well,—poor Keckky's bound to good Behaviour, or she had lost quite her Puddy's Favour. Shall I for this repine at Fortune?—No. I'm glad at Heart that I'm forgiven so. Some Neighbours Wives have but too lately shown, When Spouse had left 'em all their Friends were flown. Then all you Wives that wou'd avoid my Fate. Remain contented with your present State FINIS.

网友

2楼 · 编辑于 2024-09-28 05:19:11

试试看：

noextrawhitespace.replace("\\u2610", "")

我想你只是少了那个多余的“\”

这也可能有用。在

^{pr2}$

网友

3楼 · 编辑于 2024-09-28 05:19:11

问题是你混合了unicode和{}。每当您这样做时，Python必须将一个转换为另一个，这是通过使用sys.getdefaultencoding()来实现的，这通常是ASCII，这几乎永远不是您想要的。*

如果异常来自这一行：

noboxes = noextrawhitespace.replace(u"\u2610", "")

…修复很简单…除了您必须知道noextrawhitespace应该是unicode对象还是UTF-8编码str对象）。如果是前者，那就是：

^{pr2}$

如果是后者，那就是：

noboxes = noextrawhitespace.replace(u"\u2610".encode('utf-8'), "")

但实际上，您必须使代码中的所有字符串保持一致；将这两个字符串混合在一起会导致比这一个更大的问题。在

因为我没有您的XML文件要测试，所以我编写了自己的：

<xml>
    <text>abc&#9744;def</text>
</xml>

然后，我在代码的底部添加了这两行代码（在顶部加了一点，只需打开我的文件，而不必进行任何操作）：

noboxes = noextrawhitespace.replace(u"\u2610".encode('utf-8'), "")
print noboxes

现在的输出是：

[<text>abc☐def</text>]
[<text>abc☐def</text>]
[<text>abcdef</text>]

所以，我想这就是你想要的。在

*当然，有时你想要ASCII…但那通常不是你拥有unicode对象的时候

相关问题更多 >

编程相关推荐

热门问题

热门文章