在python2.7中，Unicode文本表示为u'xxxx而不是日语

def parse_item(self, response): original = 0 author = "noauthor" title = "notitle" year = "xxxx" publisher = "xxxx" typer = "xxxx" ispub = 0 filename = response.url.split("/")[-1] if "_" in filename: filename = filename.split("_")[0] if filename.isdigit(): title = response.xpath("//h1/text()").extract()[0].encode("utf-8") author = response.xpath("//h2/text()").extract()[0].encode("utf-8") ID = filename bibliographic_info = response.xpath("//div[2]/text()").extract() for subyear in bibliographic_info: ispub = 0 subyear = subyear.encode("utf-8").strip() if "初出：" in subyear: publisher = subyear.split("：")[1] original = 1 ispub = 1 if "入力：" in subyear: typer = subyear.split("：")[1] if len(subyear) > 1 and (original == 1) and (ispub == 0): counter = 0 while counter < len(subyear): if subyear[counter].isdigit(): break counter+=1 if counter != len(subyear): year = subyear[counter:(counter+4)] original = 0 body = str(response.xpath("//div[1]/text()").extract()) new_filename = author + "_" + title + "_" + publisher + "_" + year + "_" + typer + ".html" file = open(new_filename, "a") file.write(body.encode("utf-8") file.close()

1条回答

网友

1楼 · 发布于 2024-10-06 06:49:43

# -*- coding: utf-8 -*-
# u'初出' and u'\u521d\u51fa' are different ways to specify *the same* string
assert u'初出' == u'\u521d\u51fa'
#XXX don't mix Unicode and bytes!!!
assert u'初出' != '初出' and u'初出' != '\u521d\u51fa'

不要将str()与Unicode字符串一起使用作为参数，而是使用显式的.encode()。除非有必要，否则不要调用.encode()，.decode()；请改用Unicode三明治：

将从外部接收到的字节解码为Unicode
在脚本中保持Unicode
最后编码成字节保存到文件中，通过网络发送。在

第一步和最后一步可能是隐式的，也就是说，您的程序可能只看到Unicode文本。在

注意，这是三个不同的东西：

使用字符串文本指定字符串时，字符串在源代码中的外观（unicode转义符、源代码编码、原始字符串文本）
字符串的内容
打印时的外观（repr()，'backslashreplace'错误处理程序）

如果在输出中看到u'...'；这意味着在某个时刻{}被调用。它可能是隐式的，例如，通过print([unicode_string])，因为在将列表项转换为字符串时，repr()被调用。在

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章