如何按id分组，使所有数据集成为一个数据集

2条回答

网友
1楼 · 编辑于 2024-09-27 20:15:27

您可以使用glob读取所有文件，并在转换后使用pandas.concat连接这些文件：
from glob import glob df = pd.concat([(pd.read_csv(filename) .groupby('id', as_index=False) .agg({'sentence': '\n'.join, 'platform': 'first', 'id': 'first'}) ) for filename in glob('dataset*.csv') ])

网友
2楼 · 编辑于 2024-09-27 20:15:27

通过首先将数据帧连接在一起，然后使用groupby将句子连接在一起（按平台和id分组），可以获得所需的输出
import pandas as pd df1 = pd.DataFrame({'sentence': ['hello, I am good.', 'hello, how are u.', 'hello, xxxxxxxx.'], 'platform': ['CNN', 'CNN', 'CNN'], 'id': ['001', '001', '001']}) df2 = pd.DataFrame({'sentence': ['ok, xxxxxxxx.', 'ok, xxxxxxxx.', 'ok, xxxxxxxxxx.'], 'platform': ['FOX', 'FOX', 'FOX'], 'id': ['002', '002', '002']}) df3 = pd.DataFrame({'sentence': ['well, xxxxxxxx.', 'well, xxxxxxxx.', 'well, xxxxxxxxxx.'], 'platform': ['MMM', 'MMM', 'MMM'], 'id': ['003', '003', '003']}) df4 = pd.concat([df1, df2, df3]) df5 = df4.groupby(['platform', 'id'])['sentence'].apply('\n'.join).reset_index() # reorder and rename the columns df5 = df5[['sentence', 'platform', 'id']] df5.columns = ['content', 'platform', 'id'] print(df5)
输出：
content platform id 0 hello, I am good.\nhello, how are u.\nhello, x... CNN 001 1 ok, xxxxxxxx.\nok, xxxxxxxx.\nok, xxxxxxxxxx. FOX 002 2 well, xxxxxxxx.\nwell, xxxxxxxx.\nwell, xxxxxx... MMM 003
您还可以颠倒顺序，在每个单独的数据帧上使用groupby，然后将结果连接在一起

相关问题更多 >

编程相关推荐

热门问题

热门文章