我试图在大约10000个单词的列表中找到每个声音序列的双字频率。到目前为止,我已经能够得到二元图的频率,但它是计算列表中两个单词的序列,而不是单词中的声音。有没有一种方法可以指示我要计算的单位是什么
以下是我的python代码:
from collections import Counter
import pandas from pd
CMU_data = pd.read_csv("CMU.csv") #opening the csv file
transcript = CMU_data["Transcription"] #storing transcriptions column as a variable
def converter(x): #converting dataframe column from series to tuple
if isinstance(x, pd.Series):
return tuple(x.values)
else:
return x
transcript2 = transcript.apply(converter).unique()
print(transcript2)
#finding bigrams
data = transcript2
bigrams = Counter(x+y for x, y in zip(*[data[i:] for i in range(2)]))
for bigram, count in bigrams.most_common():
print(bigram, '=', count)
下面是当前输出的示例(哈希表示单词边界):
# P OY1 N T # # S L AE1 SH # = 1
# S L AE1 SH # # TH R IY1 D IY2 # = 1
# TH R IY1 D IY2 # # K OW1 L AH0 N # = 1
# K OW1 L AH0 N # # S EH1 M IY0 K OW1 L AH0 N # = 1
# S EH1 M IY0 K OW1 L AH0 N # # S EH1 M IH0 K OW2 L AH0 N # = 1
# S EH1 M IH0 K OW2 L AH0 N # # K W EH1 S CH AH0 N M AA1 R K # = 1
# K W EH1 S CH AH0 N M AA1 R K # # AH0 # = 1
# AH0 # # EY1 # = 1
# EY1 # # EY1 Z # = 1
# EY1 Z # # EY1 F AO1 R T UW1 W AH1 N T UW1 EY1 T # = 1
(...)
下面是我输入的示例(在转换为数组时):
['# P OY1 N T # ' '# S L AE1 SH # ' '# TH R IY1 D IY2 # ' ...
'# L EH1 F T B R EY1 S # ' '# OW1 P EH0 N B R EY1 S # '
'# K L OW1 Z B R EY1 S # ']
我希望得到如下输出:
TH R = 70
IY1 D = 100
IY2 # = 100
# K = 500
OW1 L = 100
AH0 N # = 200
N # = 500
这里有一种方法:
声音双字符计数
单词双字符计数
相关问题 更多 >
编程相关推荐