创建用于计算交集的迭代器

2024-09-28 22:21:53 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要读一个文本文件,计算一个特定的分数,然后写一个新的文件

f=open("test.txt")
fw=open("test_result.txt","w")

此文件“f”包含以下三个分隔内容:

'Washington\t\5\tWaterpark,Themepark,Playground,Spaceneedle,Carousel\n'
'California\t\6\tWaterpark,Themepark,Disneyland,Legoland,Carousel,Sixflag\n'
'Arizona\t\3\tWaterpark,Playground,Themepark\n'

我想在每行的第三列中找到每个列表组合的交叉点中只找到的项数

len(intersect_WAandCA) #3: 'Waterpark, Themepark, Carousel':intersection between 5 lists in first line and 6 lists in second line
len(intersect_WAandAZ) #3: 'Waterpark, Themepark, Playground'
len(intersect_CAandAZ) #2: 'Waterpakr, Playground'

从这里,我想做一个新的文件如下

5 Washington 6 California 3
5 Washington 3 Arizona 3
6 California 3 Arizona 2 

我试着用像in this question这样的“从itertools导入组合”来找出方法。老实说,我对Python还不熟悉。我找不到用循环创建迭代器并生成新文件的方法。实际上,我的文件包含100多行

如何创建(n*n-1)/2所有组合


Tags: 文件intesttxtlenopenintersectplayground
1条回答
网友
1楼 · 发布于 2024-09-28 22:21:53

您的输入文件是一个制表符分隔的文件,我将使用csv模块读取数据;使用set()创建第3列的集合:

import csv

with open('test.txt', 'rb') as infh:
    reader = csv.reader(infh, delimiter='\t')
    data = [(row[0], set(row[2].split(','))) for row in reader]

现在我们有了可以使用的数据;我们可以忽略第二列,同样的数字是我们集合的长度

from itertools import combinations

with open('test2.txt', 'wb') as outfh:
    writer = csv.writer(outfh, delimiter='\t')
    for (state1, features1), (state2, features2) in combinations(data, 2):
        overlap = len(features1 & features2)
        writer.writerow([
            len(features1), state1, 
            len(features2), state2,
            overlap])

这将产生:

>>> import csv
>>> data = '''\
... Washington\t5\tWaterpark,Themepark,Playground,Spaceneedle,Carousel
... California\t6\tWaterpark,Themepark,Disneyland,Legoland,Carousel,Sixflag
... Arizona\t3\tWaterpark,Playground,Themepark
... '''.splitlines(True)
>>> reader = csv.reader(data, delimiter='\t')
>>> data = [(row[0], set(row[2].split(','))) for row in reader]
>>> import sys
>>> writer = csv.writer(sys.stdout, delimiter='\t')
>>> for (state1, features1), (state2, features2) in combinations(data, 2):
...     overlap = len(features1 & features2)
...     writer.writerow([
...         len(features1), state1,
...         len(features2), state2,
...         overlap])
... 
5   Washington  6   California  3
5   Washington  3   Arizona 3
6   California  3   Arizona 2

相关问题 更多 >