在从fi读取的列表中拆分

2024-06-28 19:29:51 发布

男 | 程序猿一只，喜欢编程写python代码。

我试图读取big data file.txt并拆分所有逗号、点等，因此我用Python读取了包含以下代码的文件：

file= open("file.txt","r")
importantWords =[]
for i in file.readlines():
    line = i[:-1].split(" ")
    for word in line:
        for j in word:
            word = re.sub('[\!@#$%^&*-/,.;:]','',word)
            word.lower()
        if word not in stopwords.words('spanish'):
            importantWords.append(word)
print importantWords

它还打印了['\xef\xbb\xbfdataText1', 'dataText2' .. 'dataTextn']。

我该怎么清洗这个\xef\xbb\xbf？我正在使用Python2.7。

Tags：文件代码 in txt for data line open

1条回答

网友

1楼 · 发布于 2024-06-28 19:29:51

是UTF-8 encoded BOM。

>>> import codecs
>>> codecs.BOM_UTF8
'\xef\xbb\xbf'

可以使用^{}和^{}跳过BOM序列：

with codecs.open("file.txt", "r", encoding="utf-8-sig") as f:
    for line in f:
        ...

旁注：不要使用file.readlines，只要遍历文件即可。file.readlines将创建不必要的临时列表，如果您只想遍历文件。

在从fi读取的列表中拆分

相关问题更多 >

编程相关推荐

热门问题

热门文章

在从fi读取的列表中拆分

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >