python3.x替换为Unicode字符

Test file (saved as UTF-8): Test to check<CRLF>whether it's working.<CRLF>Aquela é<CRLF>árvore pequena.<CRLF> import os input_file = 'test.txt' input_file_path = os.path.join("c:", "\\", "Users", "Paulo", "workspace", "pdf_to_text", input_file) input_string = open(input_file_path).read() print(input_string) import re pattern = r'\n([a-zàáâãäåæçčèéêëěìíîïłðñńòóôõöøőřśŝšùúûüůýÿżžÞ]+)' pattern_obj = re.compile(pattern) replacement_string = " \\1" output_string = pattern_obj.sub(replacement_string, input_string) print(output_string)`

1条回答

网友

1楼 · 发布于 2024-06-20 14:59:34

... The unicode characters é and á in the original file are changed to Ã© and Ã¡ respectively when I read() the file.

您的实际问题与regex无关。您正在使用不正确的拉丁1编码读取utf-8文本。在

>>> print("é".encode('utf-8').decode('latin-1'))
Ã©
>>> print("á".encode('utf-8').decode('latin-1'))
Ã¡

要读取utf-8文件：

^{pr2}$

关于regex的旧答案（与OPs问题无关）：

一般来说，单个用户感知的字符，如ç，é可能跨越多个Unicode码位，因此[çé]可以分别匹配这些Unicode码元，而不是匹配整个字符。(?:ç|é)可以解决这个问题，还有其他问题，例如Unicode规范化（NFC，NFKD）。在

I want to replace line feeds by spaces when the next line begins with a lowercase character.

regex模块支持POSIX字符类[:lower:]：

import regex # $ pip install regex

text = ("Test to check\n"
        "whether it's working.\n"
        "Aquela \xe9\n"
        "\xe1rvore pequena.\n")
print(text)
# -> Test to check
# -> whether it's working.
# -> Aquela é
# -> árvore pequena.
print(regex.sub(r'\n(?=[[:lower:]])', ' ', text))
# -> Test to check whether it's working.
# -> Aquela é árvore pequena.

要使用re模块模拟[:lower:]类：

import re
import sys
import unicodedata

# \p{Ll} chars
lower_chars = [u for u in map(chr, range(sys.maxunicode)) 
               if unicodedata.category(u) == 'Ll']
lower = "|".join(map(re.escape, lower_chars))
print(re.sub(r"\n(?={lower})".format(lower=lower), ' ', text))

结果是一样的。在

关于regex的旧答案（与OPs问题无关）：

相关问题更多 >

编程相关推荐

热门问题

热门文章