用python创建一个代码，从lis中获取最频繁的标记和值对

file=open("/Users/Desktop/Folder1/trained.txt").read().split('\n') d = {} for i in file: if i[1:] in d.keys(): d[i[1:]] += 1 else: d[i[1:]] = 1 print (sorted(d.items(), key=lambda x: x[1], reverse=True))

[('', 15866), ('\t.\t.', 9479), ('\ti\tPRP', 7234), ('\tto\tTO', 4329), ('\tlike\tVB', 2533), ('\tabout\tIN', 2518), ('\tthe\tDT', 2389), ('\tfood\tNN', 2092), ('\ta\tDT', 2053), ('\tme\tPRP', 1870), ('\twant\tVBP', 1713), ('\twould\tMD', 1507), ('0\t.\t.', 1427), ('\teat\tVB', 1390), ('\trestaurant\tNN', 1371), ('\tuh\tUH', 1356), ('1\t.\t.', 1265), ('\ton\tIN', 1237), ("\t'd\tMD", 1221), ('\tyou\tPRP', 1145), ('\thave\tVB', 1127), ('\tis\tVBZ', 1098), ('\ttell\tVB', 1030), ('\tfor\tIN', 987), ('\tdollars\tNNS', 959), ('\tdo\tVBP', 956), ('\tgo\tVB', 931), ('2\t.\t.', 912), ('\trestaurants\tNNS', 899),

2条回答

网友

1楼 · 编辑于 2024-10-05 10:39:19

如果您不介意使用pandas，这是一个很好的表格数据库，我会做以下事情：

import pandas as pd
df = pd.read_csv("/Users/Desktop/Folder1/trained.txt", sep=" ", header=None, names=["position", "word", "tag"])
df["word_tag_counts"] = df.groupby(["word", "tag"]).transform("count")

如果您只想从每组中获得最大值，您可以：

^{pr2}$

它会给你一个包含你想要的值的表

网友

2楼 · 编辑于 2024-10-05 10:39:19

每个单词需要有一个单独的collections.Counter。此代码使用defaultdict创建计数器字典，而不检查每个单词是否已知。在

from collections import Counter, defaultdict

counts = defaultdict(Counter)
for row in file:           # read one line into `row`
    if not row.strip():
        continue           # ignore empty lines
    pos, word, tag = row.split()
    counts[word.lower()][tag] += 1

就这样，你现在可以检查任何单词中最常见的标记：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章