为每个字符串列表添加特定字符在python中

if __name__ == "__main__": io = morfessor.MorfessorIO() print "Importing corpus ..." f = codecs.open("corpus/corpus_tr_en/corpus.tr", encoding="utf-8").readlines() print "Importing morphology model ..." model = io.read_binary_model_file('seg/tr/model.bin') corpus = open('dataset/dataset_tr_en/full_segmented.tr', 'w') for a in range(len(f)): print str(a) + ' : ' + str(len(f)) words = f[a].replace('\n', '').split() line_str = '' for word in words: segmentation = model.viterbi_segment(word)[0] if len(segmentation) == 1: line_str = '/' + segmentation[0] + '/' if len(segmentation) == 2: line_str = '/' + segmentation[0] + '* *' + segmentation[1] + '/' if len(segmentation) > 2: line_str = '' for b in range(len(segmentation)): if (b == 0): line_str = line_str + '/' + segmentation[b] + '*' if (b != 0) and (b != (len(segmentation) - 1)): line_str = line_str + ' *' + segmentation[b] + '* ' if (b == (len(segmentation) - 1)): line_str = line_str + ' *' + segmentation[b] + '/' line_str = line_str + ' ' corpus.write(line_str.encode('utf-8')) corpus.write('\n') corpus.close()

3条回答

网友

1楼 · 编辑于 2024-10-05 10:37:10

使用多个字符串连接的line_str的组合可能会减慢很多速度，如果您想要性能，不建议使用多个字符串连接（对于filename = base+".txt"之类的内容，这是可以的，但对于密集处理则不是。你知道吗

将line创建为list，然后使用str.join创建最后一个字符串以将其写入磁盘。附加到list要快得多。你知道吗

正如Maximilian刚才所建议的，您可以将您的条件转换为elif，因为它们是相互排斥的（x2）。还添加了一些微优化，以提高可读性以及。你知道吗

我建议你的内环应该是什么样子：

for word in words:
        segmentation = model.viterbi_segment(word)[0]
        lenseg = len(segmentation)
        if lenseg == 1:
                line = ['/',segmentation[0],'/']
        elif lenseg == 2:
                line = ['/',segmentation[0],'* *',segmentation[1],'/']
        elif lenseg > 2:
                line = []
                for b in range(lenseg):
                        if b == 0:
                                line += ['/',segmentation[0],'*']
                        elif b != (lenseg - 1):
                                line += [' *',segmentation[b],'* ']
                        else:
                                line+= [' *',segmentation[b],'/']
        line.append(" ")
        corpus.write("".join(line).encode('utf-8'))

备选方案：

每次都将每个字符串写入输出文件
将数据写入io.StringIO对象并检索它以写入输出文件。你知道吗

网友

2楼 · 编辑于 2024-10-05 10:37:10

这样的内环怎么样：

line = '* *'.join(segmentation)
corpus.write(("/%s/ " % line).encode('utf-8'))

那么，既然你可以同时把输入保存在内存中，我也会尽量把输出保存在内存中，一次就写出来，可能是这样的：

lines = []
for a in range(len(f)):
    print str(a) + ' : ' + str(len(f))
    words = f[a].replace('\n', '').split()
    for word in words:
        line = '* *'.join(segmentation)
        lines.append("/%s/ " % line)
corpus.write("\n".join(lines).encode('utf-8')

网友

3楼 · 编辑于 2024-10-05 10:37:10

让-弗朗索瓦-法布很好地报道了string optimization。你知道吗
另一个特点是对37251512个句子使用readlines()。只需使用for a in f，有关详细说明，请参见here。你知道吗
取决于您的数据中有多少重复项以及型号：viterbi\u段在函数中，使用^{}的单词，而不是对重复的单词进行重复操作，可能是有益的。你知道吗
似乎您使用的是python2，在这种情况下，使用^{}而不是range
.replace('\n', '').split()很慢，因为当您只想删除最后一个换行符时，它必须在整行上循环（在您的情况下不能有多个换行符）。你可以用^{}`
您的代码中有一些冗余，例如，每行需要以/结尾，但您有3个位置。你知道吗
所有这些变化可能很小，但它们会加起来，你的代码也变得更容易阅读

相关问题更多 >

编程相关推荐

热门问题

热门文章