从不属于单词的字符串列表中删除标点符号

f = open('Collocations.txt').read() punctuation = [',', '.', '!', '?', '"', ':', "'", ';', '@', '&', '$', '#', '*', '^', '%', '{', '}'] filteredf = re.sub(r'[,":@#?!&$%}{]', '', f) f = f.split() print(len(f)) for i, j in zip (punctuation, f): if i == j: ind = f.index(j) f.remove(f[ind]) print(len(f)) # removes first element in the temp list to prepare to make bigrams temp = list() temp2 = list() temp = filteredf.split() temp2 = filteredf.split() temp2.remove(temp2[0]) # forms a list of bigrams bi = list() for i, j in zip(temp, temp2): x = i + " " + j bi.append(x) #print(len(bi)) unigrams = dict() for i in temp: unigrams[i] = unigrams.get(i, 0) + 1 #print(len(unigrams)) bigrams = dict() for i in bi: bigrams[i] = bigrams.get(i, 0) + 1 #print(len(bigramenter code here`

1条回答

网友

1楼 · 发布于 2024-06-26 18:00:28

更换

for i, j in zip (punctuation, f):
    if i == j:
        ind = f.index(j)
        f.remove(f[ind])

与

while i < len(f)-2:
    c1 = f[i]
    c2 = f[i+1]
    c3 = f[i+2]
    if c2 in punctuation and not (c1 in string.ascii_letters and c3 in string.ascii_letters):
        f = f[:i+1] + f[i+2:]
    i+=1

将保留两边都有字母的标点符号（例如，U.S.a.将成为U.S.a），但是在我看来，不可能区分句号和句号之间的区别，例如U.S.a.和Hello.

相关问题更多 >

编程相关推荐

热门问题

热门文章