在Python中使用中文

#!/usr/bin/env python # -*- coding: utf-8 -* import re from pprint import pprint import sys, locale, os columnString = row[columnName] startFrom = valuestoremove["startsTo"] endWith = valuestoremove["endsAt"] isInclude = valuestoremove["include"] escapeCharsRegex = re.compile('([\.\^\$\*\+\?\[\{\|])') nonASCIIregex = re.compile('([^\x00-\x7F])') if escapeCharsRegex.match(startFrom): startFrom = re.escape(startFrom) if escapeCharsRegex.match(endWith): endWith = re.escape(endWith) if isInclude: regex = startFrom + '(.*)' + endWith else: regex = '(?<=' + startFrom + ').*?(?=' + endWith + ')' if nonASCIIregex.match(regex): p = re.compile(ur'' + regex) else: p = re.compile(regex) row[columnName] = p.sub("", columnString).strip()

{ "between": { "startsTo": "(", "endsAt": "）", "include": true, "sequenceID": "1" } }, { "between": { "startsTo": "（", "endsAt": ")", "include": true, "sequenceID": "2" } },{ "between": { "startsTo": "(", "endsAt": ")", "include": true, "sequenceID": "2" } },{ "between": { "startsTo": "（", "endsAt": "）", "include": true, "sequenceID": "2" } }

2条回答

网友

1楼 · 编辑于 2024-09-27 21:29:45

经过多次搜索和协商，这里有了一个解决中文文本（也有混合语言和非混合语言）的方法

import codecs
def betweencase(valuestoremove, row, columnName):
    columnString = row[columnName]
    startFrom = valuestoremove["startsTo"]
    endWith = valuestoremove["endsAt"]
    isInclude = valuestoremove["include"]
    escapeCharsRegex = re.compile('([\.\^\$\*\+\?\(\)\[\{\|])')
    if escapeCharsRegex.match(startFrom):
        startFrom = re.escape(startFrom)
    if escapeCharsRegex.match(endWith):
        endWith = re.escape(endWith)
    if isInclude:
        regex = ur'' + startFrom + '(.*)' + endWith
    else:
        regex = ur'(?<=' + startFrom + ').*?(?=' + endWith + ')'

    ***p = re.compile(codecs.encode(unicode(regex), "utf-8"))***
    delimiter = ' '
    if localization == 'CN':
        delimiter = ''

    row[columnName] = p.sub(delimiter, columnString).strip()

如您所见，我们将任何regex编码为utf-8，因此postgresql db值与regex匹配。在

网友

2楼 · 编辑于 2024-09-27 21:29:45

问题是，您正在阅读的文本没有被正确地理解为Unicode（这是促使python3k进行彻底更改的一个大问题）。而不是：

data_file = myfile.read()

你需要告诉它解码文件：

^{pr2}$

然后继续使用json.loads等，它应该可以很好地工作。或者

data = json.load(myfile, "utf8")

相关问题更多 >

编程相关推荐

热门问题

热门文章