Itertools Groupby给出了一个意外的结果

2024-09-30 10:27:25 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两张单子

finalblobfpost1=['ABC/XYZ/16082020/K1_SS_ALM_222222_14082020.txt','ABC/XYZ/16082020/K1_SS_ALM_111111_14082020.txt','ABC/XYZ/15082020/K1_AB_KIL_444444_15082020.txt','ABC/XYZ/15082020/K1_AB_KIL_333333_15082020.txt']

“K1_SS_ALM”的日期相同

finalblobfpost2=['ABC/XYZ/15082020/K1_SS_ALM_222222_15082020.txt','ABC/XYZ/16082020/K1_SS_ALM_111111_16082020.txt','ABC/XYZ/15082020/K1_AB_KIL_444444_15082020.txt','ABC/XYZ/16082020/K1_AB_KIL_333333_16082020.txt']

与“K1_SS_ALM”的日期不同

我需要使用K1_SS_ALM和K1_AB_KIL(re.findall(“\w+/\w+/\d+/(.*?)\ud+\ud+.txt”,text))分组。

迄今为止:

finalblobfpost1=['ABC/XYZ/16082020/K1_SS_ALM_222222_14082020.txt','ABC/XYZ/16082020/K1_SS_ALM_111111_14082020.txt','ABC/XYZ/15082020/K1_AB_KIL_444444_15082020.txt','ABC/XYZ/15082020/K1_AB_KIL_333333_15082020.txt']
keyf = lambda text: (re.findall("\w+\/\w+\/\d+\/(.*?)\_\d+_\d+.txt", text)+ [text])[0].strip()
h=[list(items) for gr, items in groupby(sorted(finalblobfpost1), key=keyf)]
print(h)

结果是-预期足够好

[['ABC/XYZ/15082020/K1_AB_KIL_333333_15082020.txt', 'ABC/XYZ/15082020/K1_AB_KIL_444444_15082020.txt'], ['ABC/XYZ/16082020/K1_SS_ALM_111111_14082020.txt',
'ABC/XYZ/16082020/K1_SS_ALM_222222_14082020.txt']]

代码:2

finalblobfpost2=['ABC/XYZ/15082020/K1_SS_ALM_222222_15082020.txt','ABC/XYZ/16082020/K1_SS_ALM_111111_16082020.txt','ABC/XYZ/15082020/K1_AB_KIL_444444_15082020.txt','ABC/XYZ/16082020/K1_AB_KIL_333333_16082020.txt']
keyf1 = lambda text: (re.findall("\w+\/\w+\/\d+\/(.*?)\_\d+_\d+.txt", text)+ [text])[0].strip()
h1=[list(items) for gr, items in groupby(sorted(finalblobfpost2), key=keyf1)]
print(h1)

结果是:不是预期的

[['ABC/XYZ/15082020/K1_AB_KIL_444444_15082020.txt'], ['ABC/XYZ/15082020/K1_SS_ALM_222222_15082020.txt'], ['ABC/XYZ/16082020/K1_AB_KIL_333333_16082020.txt'], ['ABC/XYZ/16082020/K1_SS_ALM_111111_16082020.txt']]

预期为:

[['ABC/XYZ/15082020/K1_AB_KIL_444444_15082020.txt','ABC/XYZ/16082020/K1_AB_KIL_333333_16082020.txt'],['ABC/XYZ/16082020/K1_SS_ALM_111111_16082020.txt','ABC/XYZ/15082020/K1_AB_KIL_444444_15082020.txt']]

它没有对关键字进行分组。regex有什么问题吗?或者我做错了什么

请告知


Tags: textretxtabitemsk1ssabc
2条回答

您的列表需要按照groupby中使用的相同键函数进行排序

试试这个:

h1=[list(items) for gr, items in groupby(sorted(finalblobfpost2, key=keyf1), key=keyf1)]

唯一的区别是对sorted的调用中的key=keyf1

输出(与预期相同):

[['ABC/XYZ/15082020/K1_AB_KIL_444444_15082020.txt', 'ABC/XYZ/16082020/K1_AB_KIL_333333_16082020.txt'], ['ABC/XYZ/15082020/K1_SS_ALM_222222_15082020.txt', 'ABC/XYZ/16082020/K1_SS_ALM_111111_16082020.txt']]

这是在docs for ^{}中显式编写的:

The operation of groupby() is similar to the uniq filter in Unix. It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function).

试试这个

^{}

import re
from itertools import groupby

print(
    [list(v) for _, v in groupby(finalblobfpost1,
                                 key=lambda x: re.search("\w\d+_\w{2}_\w{3}", x).group())]
)

[['ABC/XYZ/16082020/K1_SS_ALM_222222_14082020.txt', 'ABC/XYZ/16082020/K1_SS_ALM_111111_14082020.txt'], ['ABC/XYZ/15082020/K1_AB_KIL_444444_15082020.txt', 'ABC/XYZ/15082020/K1_AB_KIL_333333_15082020.txt']]

相关问题 更多 >

    热门问题