我正在尝试对中文文本进行文本分析。程序如下所示。我得到的结果是不可读的字符,比如浜烘皯鏃ユ姤绀捐
。如果我将输出文件result.csv
更改为result.txt
,则字符正确为人民日报社论
。那这是怎么回事?我想不通。我尝试了几种方法,包括添加decoder
和encoder
。
# -*- coding: utf-8 -*-
import os
import glob
import jieba
import jieba.analyse
import csv
import codecs
segList = []
raw_data_path = 'monthly_raw_data/'
file_name = ["201010", "201011", "201012", "201101", "201103", "201105", "201107", "201109", "201110", "201111", "201112", "201201", "201202", "201203", "201205", "201206", "201208", "201210", "201211"]
jieba.load_userdict("customized_dict.txt")
for name in file_name:
all_text = ""
multi_line_text = ""
with open(raw_data_path + name + ".txt", "r") as file:
for line in file:
if line != '\n':
multi_line_text += line
templist = multi_line_text.split('\n')
for text in templist:
all_text += text
seg_list = jieba.cut(all_text,cut_all=False)
temp_text = []
for item in seg_list:
temp_text.append(item.encode('utf-8'))
stop_list = []
with open("stopwords.txt", "r") as stoplistfile:
for item in stoplistfile:
stop_list.append(item.rstrip('\r\n'))
text_without_stopwords = []
for word in temp_text:
if word not in stop_list:
text_without_stopwords.append(word)
segList.append(text_without_stopwords)
with open("results/result.csv", 'wb') as f:
writer = csv.writer(f)
writer.writerows(segList)
这里有另一种方法有点棘手:
此代码块生成csv文件编码的utf-8。
用Excel打开文件,没问题。
对于UTF-8编码,Excel需要在文件开头写入BOM(字节顺序标记)代码点,否则它将采用
ANSI
编码,这取决于语言环境。U+FEFF
是Unicode BOM。下面是一个在Excel中正确打开的示例:为了完整起见,Python 3使这一点更容易实现。注意
newline=''
参数而不是wb
和utf-8-sig
编码会自动添加BOM。Unicode字符串是直接写入的,而不需要对每个项进行编码。还有第三方模块
unicodecsv
也使Python 2更容易:相关问题 更多 >
编程相关推荐