在包含项列表的列中查找公共值问题的回答

在包含项列表的列中查找公共值

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我有一个数据集，其中包含一些列，这些列是一个项目列表。下面我举了一个例子。我正在尝试查找列表中有100%匹配项的条目。我想找90%或更低的。你知道吗 <pre><code>>>> df2 = pd.DataFrame({ 'ID':['1', '2', '3', '4', '5', '6', '7', '8'], 'Productdetailed': [['Phone', 'Watch', 'Pen'], ['Pencil', 'fork', 'Eraser'], ['Apple', 'Mango', 'Orange'], ['Something', 'Nothing', 'Everything'], ['Eraser', 'fork', 'Pencil'], ['Phone', 'Watch', 'Pen'],['Apple', 'Mango'], ['Pen', 'Phone', 'Watch']]}) >>> df2 ID Productdetailed 0 1 [Phone, Watch, Pen] 1 2 [Pencil, fork, Eraser] 2 3 [Apple, Mango, Orange] 3 4 [Something, Nothing, Everything] 4 5 [Eraser, fork, Pencil] 5 6 [Phone, Watch, Pen] 6 7 [Apple, Mango] 7 8 [Pen, Phone, Watch] </code></pre> 如果注意到<code>df2</code>中的索引0和索引7，则它们具有相同的项集，但顺序不同。其中索引0和索引5具有相同顺序的相同项。我想把他们两个看作是一对。我试过<code>groupby</code>和<code>series.isin()</code>。我还尝试将数据集拆分为两个数据集，但由于类型错误而失败。你知道吗 首先，我想计算完全匹配的项的数量（匹配行的数量也可以）以及它匹配到的行索引号。但是当有像df2中的索引2和索引6这样只有部分匹配的项时。我想说的是已经匹配的项目的百分比，以及与之对应的列号。你知道吗 我提到过。我试图将特定列值的数据分为两部分。那么 <pre><code>applied df2['Intersection'] = [list(set(a).intersection(set(b))) for a, b in zip(df2_part1.Productdetailed, df2_part2.Productdetailed) ] </code></pre> ，其中<code>a</code>和<code>b</code>是来自<code>df2_part1</code>和<code>df2_part2</code>的碎片的<code>Productdetailed</code>列。你知道吗 有办法吗？请帮忙

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

此解决方案解决了精确匹配任务（代码复杂度非常高，不建议使用）： <pre><code>#First create a dummy column of Productdetailed which is sorted df2['dummy'] = df2['Productdetailed'].apply(sorted) #Create Matching column which stores index of first matched list df2['Matching'] = np.nan #Code for finding the exact matches and assigning indices in Matching column for index1,lst1 in enumerate(df2['dummy']): for index2,lst2 in enumerate(df2['dummy']): if index1<index2: if (lst1 == lst2): if np.isnan(df2.loc[index2,'Matching']): df2.loc[index1,'Matching'] = index1 df2.loc[index2,'Matching'] = index1 #Finding the sum of total exact matches print(df2['Matching'].notnull().sum()) 5 #Deleting the dummy column del df2['dummy'] #Final Dataframe print(df2) ID Productdetailed Matching 0 1 [Phone, Watch, Pen] 0.0 1 2 [Pencil, fork, Eraser] 1.0 2 3 [Apple, Mango, Orange] NaN 3 4 [Something, Nothing, Everything] NaN 4 5 [Eraser, fork, Pencil] 1.0 5 6 [Phone, Watch, Pen] 0.0 6 7 [Apple, Mango] NaN 7 8 [Pen, Phone, Watch] 0.0 </code></pre> <hr/> 对于完全匹配和部分匹配使用（如果至少有2个值匹配，则部分匹配也可以更改）： <pre><code>#First create a dummy column of Productdetailed which is sorted df2['dummy'] = df2['Productdetailed'].apply(sorted) #Create Matching column which stores index of first matched list df2['Matching'] = np.nan #Create Column Stating Status of Matching df2['Status'] = 'No Match' #Code for finding the exact matches and assigning indices in Matching column for index1,lst1 in enumerate(df2['dummy']): for index2,lst2 in enumerate(df2['dummy']): if index1<index2: if (lst1 == lst2): if np.isnan(df2.loc[index2,'Matching']): df2.loc[index1,'Matching'] = index1 df2.loc[index2,'Matching'] = index1 df2.loc[[index1,index2],'Status'] = 'Fully Matched' else: count = sum([1 for v1 in lst1 for v2 in lst2 if v1==v2]) if count>=2: if np.isnan(df2.loc[index2,'Matching']): df2.loc[index1,'Matching'] = index1 df2.loc[index2,'Matching'] = index1 df2.loc[[index1,index2],'Status'] = 'Partially Matched' #Finding the sum of total exact matches print(df2['Matching'].notnull().sum()) 7 #Deleting the dummy column del df2['dummy'] #Final Dataframe print(df2) </code></pre> <hr/> <pre><code> ID Productdetailed Matching Status 0 1 [Phone, Watch, Pen] 0.0 Fully Matched 1 2 [Pencil, fork, Eraser] 1.0 Fully Matched 2 3 [Apple, Mango, Orange] 2.0 Partially Matched 3 4 [Something, Nothing, Everything] NaN No Match 4 5 [Eraser, fork, Pencil] 1.0 Fully Matched 5 6 [Phone, Watch, Pen] 0.0 Fully Matched 6 7 [Apple, Mango] 2.0 Partially Matched 7 8 [Pen, Phone, Watch] 0.0 Fully Matched </code></pre>

在包含项列表的列中查找公共值

1 个回答

相关Python问题