大Pandas中相似项目的分组

# We have a listing of files for the movie Titanic # And we want to break them into groups of similar titles, # To see which of those are possible duplicates. import pandas as pd titanic_files = [ {"File": "Titanic_HD2398.mov", "Resolution": "HD", "FrameRate": 23.98, "Runtime": 102}, {"File": "Titanic1.mov", "Resolution": "SD", "FrameRate": 23.98, "Runtime": 102}, {"File": "Titanic1.mov", "Resolution": "HD", "FrameRate": 23.98, "Runtime": 102}, {"File": "Titanic.mov", "Resolution": "HD", "FrameRate": 24.00, "Runtime": 103}, {"File": "MY_HD2398.mov", "Resolution": "HD", "FrameRate": 23.98, "Runtime": 102} ] df = pd.DataFrame(titanic_files)

---- HD ---- File Resolution FrameRate RunTime Titanic_HD2398.mov HD 23.98 102 Titanic1.mov HD 23.98 102 Titanic.mov HD 24.00 103 MY_HD2398.mov HD 23.98 102 ---- SD ---- File Resolution FrameRate RunTime Titanic1.mov SD 23.98 102

---- HD ----------------------- +----------- 23.98 ------------ File Resolution FrameRate RunTime Titanic_HD2398.mov HD 23.98 102 Titanic1.mov HD 23.98 102 MY_HD2398.mov HD 23.98 102 +----------- 24.00 ------------ File Resolution FrameRate RunTime Titanic.mov HD 24.00 103 ---- SD ----------------------- + ---------- 23.98 ------------ File Resolution FrameRate RunTime Titanic1.mov SD 23.98 102

1条回答

网友

1楼 · 发布于 2024-05-19 15:20:05

Pandas groupby似乎是要使用的工具，它可以根据需要使用任意多个分组，它们可以是列表、系列、列名、索引级别、可调用类型。。。随便你说

例如，您可以执行以下操作：

df = df.groupby(
    [
        'Resolution', df.FrameRate//0.02 * 0.02,
        pd.cut(df.Runtime, bins=[45, 90, 95, 100, 120])
    ]
).File.apply(list)

它将返回一个数据帧，该数据帧具有3个级别的唯一多索引和一列，每行包含一个文件名列表

如果出于某种原因，使用其他数据，希望将一个df拆分为多个df并保持这种状态，则还可以获取每个组的完整行

for group_id, group_rows in df.groupby(...):
    # group id are tuples each with a unique combination of the grouping vectors
    # group_rows is a df of the matching rows, with the same columns as df

相关问题更多 >

编程相关推荐

热门问题

热门文章