自然语言语料库字符串

2024-10-03 02:47:53 发布

您现在位置:Python中文网/ 问答频道 /正文

从语料库1、语料库2和语料库3中抽取句子样本,并显示平均长度(以句子中的字符数来衡量)

所以我有3个语料库,样本是一个用来返回随机句子的定义函数:

tcr = corpus1()
rcr = corpus2()
mcr = corpus3()  
sample_size=50
for sentence in tcr.sample_raw_sents(sample_size):
    print(len(sentence))
for sentence in rcr.sample_raw_sents(sample_size):
    print(len(sentence))
for sentence in mcr.sample_raw_sents(sample_size):
    print(len(sentence))  

所以使用这个代码所有的长度都会被打印出来,但是我如何求这些长度的和呢


Tags: sampleinforsizerawlensentence句子
3条回答

您可以将sentences的所有长度存储在^{}中,然后将它们相加

tcr = corpus1()
rcr = corpus2()
mcr = corpus3()  
sample_size=50

lengths = []
for sentence in tcr.sample_raw_sents(sample_size):
    lengths.append(len(sentence))
for sentence in rcr.sample_raw_sents(sample_size):
    lengths.append(len(sentence))
for sentence in mcr.sample_raw_sents(sample_size):
    lengths.append(len(sentence))

print(sum(lengths) / len(lengths))
tcr = corpus1()
rcr = corpus2()
mcr = corpus3()  
sample_size=50
s = 0
for sentence in tcr.sample_raw_sents(sample_size):
    s = s + len(sentence)
for sentence in rcr.sample_raw_sents(sample_size):
    s = s + len(sentence)
for sentence in mcr.sample_raw_sents(sample_size):
    s = s + len(sentence)

average = s/150
print('average: {}'.format(average))

使用zip,它将允许您一次从每个语料库中提取一个句子

tcr = corpus1()
rcr = corpus2()
mcr = corpus3()  
sample_size=50

zipped = zip(tcr.sample_raw_sents(sample_size),
             rcr.sample_raw_sents(sample_size),
             mcr.sample_raw_sents(sample_size))

for s1, s2, s3 in zipped:
    summed = len(s1) + len(s2) + len(s3)
    average = summed/3
    print(summed, average)

相关问题 更多 >