使用python比较两个csv文件中的第一列并打印匹配项

2024-10-03 06:28:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个csv文件,每个文件都包含如下ngram:

drinks while strutting,4,1.435486010883783160220299732E-8
and since that,6,4.306458032651349480660899195E-8
the state face,3,2.153229016325674740330449597E-8

这是一个三个词的短语,后面跟着一个频率数,然后是一个相对频率数。在

我想写一个脚本,找到两个csv文件中的ngram,划分它们的相对频率,然后将它们打印到一个新的csv文件中。我希望它找到一个匹配,无论何时三个词短语匹配另一个文件中的三个词短语,然后将第一个csv文件中短语的相对频率除以第二个csv文件中相同短语的相对频率。然后我想把短语和两个相对频率的划分打印到一个新的csv文件中。在

下面是我所能得到的。我的脚本是比较行,但只有当整个行(包括频率和相对频率)完全匹配时才会找到匹配项。我意识到这是因为我在寻找两个完整集合之间的交集,但我不知道如何用不同的方法。请原谅我,我不懂编码。你能给我的任何帮助都会是一个很大的帮助。在

^{pr2}$

Tags: and文件csvthe脚本that频率face
3条回答

My script is comparing lines but only finds a match when the entire line (including the frequencies and relative frequencies) matches exactly. I realize that that is because I'm finding the intersection between two entire sets but I have no idea how to do this differently.

这正是字典的用途:当你有一个单独的键和值(或者只有部分值是键时)。所以:

a_dict = {row[0]: row for row in alist}
b_dict = {row[0]: row for row in blist}

现在,您不能在字典上直接使用set方法。Python3在这里提供了一些帮助,但是您使用的是2.7。所以,你必须明确地写下:

^{pr2}$

或者:

matches = set(a_dict) & set(b_dict)

但实际上并不需要集合;您只需要在这里迭代它们。所以:

for key in a_dict:
    if key in b_dict:
        a_values = a_dict[key]
        b_values = b_dict[key]
        do_stuff_with(a_values[2], b_values[2])

作为一个补充说明,你真的不需要在一开始就建立列表,只是为了把它们变成集合,或者dicts。只需建立集合或指令:

a_set = set()
with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        a_set.add(tuple(row))

a_dict = {}
with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        a_dict[row[0]] = row

另外,如果你知道理解,这三个版本都迫切需要转换:

with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    # Now any of these
    a_list = list(reader)
    a_set = {tuple(row) for row in reader}
    a_dict = {row[0]: row for row in reader}

您可以将第一个文件中的相对频率存储到字典中,然后迭代第二个文件,如果第一列与原始文件中的任何内容相匹配,则将结果直接写入输出文件:

import csv

tmp = {}

# if 1 file is much larger than the other, load the smaller one here
# make sure it will fit into the memory
with open("ngrams.csv", "rb") as fr:
    # using tuple unpacking to extract fixed number of columns from each row
    for txt, abs, rel in csv.reader(fr):
        # converting strings like "1.435486010883783160220299732E-8"
        # to float numbers
        tmp[txt] = float(rel)

with open("matchedngrams.csv", "wb") as fw:
    writer = csv.writer(fw)

    # the 2nd input file will be processed per 1 line to save memory
    # the order of items from this file will be preserved
    with open("ngramstest.csv", "rb") as fr:
        for txt, abs, rel in csv.reader(fr):
            if txt in tmp:
                # not sure what you want to do with absolute, I use 0 here:
                writer.writerow((txt, 0, tmp[txt] / float(rel)))

在新文件中没有转储res(乏味)。第一个元素是短语,另外两个是频率。使用dict代替set一起进行匹配和映射。在

import csv
import io 

alist, blist = [], []

with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        alist.append(row)
with open("ngramstest.csv", "rb") as fileB:
    reader = csv.reader(fileB, delimiter=',')
    for row in reader:
        blist.append(row)

f_dict = {e[0]:e[1:] for e in alist}
s_dict = {e[0]:e[1:] for e in blist}

res = {}
for k,v in f_dict.items():
    if k in s_dict:
        res[k] = float(v[1])/float(s_dict[k][1])

print(res)

相关问题 更多 >