具有巨大数据帧的内部联接（约200万列）

2条回答

网友

1楼 · 编辑于 2024-07-01 07:07:22

谢谢大家的帮助！使用数据表正如@shadowtalker建议的那样，极大地加快了这个过程。仅供参考，以防有人试图做类似的事情，df1大约是400MB，我的df2文件大约是3gb。在

我完成了以下任务：

library(data.table)
df1 <- setDT(df1)
df2 <- setDT(df2)
setkey(df1, Name)
setkey(df2, Name)
df3 <- df1[df2, nomatch = 0]

网友

2楼 · 编辑于 2024-07-01 07:07:22

这是一个非常难看的解决方法，我将df2的列分解并逐个添加它们。不确定它是否有效，但可能值得一试：

# First, I only grab the "Name" column from df2
df3 = df1.merge(right=df2[["Name"]], how="inner", on="Name")  

# Then I save all the column headers (excluding 
# the "Name" column) in a separate list
df2_columns = df2.columns[np.logical_not(df2.columns.isin(["Name"]))]

# This determines how many columns are going to get added each time.
num_cols_per_loop = 1000

# And this just calculates how many times you'll need to go through the loop
# given the number of columns you set to get added each loop
num_loops = int(len(df2_columns)/num_cols_per_loop) + 1

for i in range(num_loops):
    # For each run of the loop, we determine which rows will get added
    this_column_sublist = df2_columns[i*num_cols_per_loop : (i+1)*num_cols_per_loop]

    # You also need to add the "Name" column to make sure 
    # you get the observations in the right order
    this_column_sublist = np.append("Name",this_column_sublist)

    # Finally, merge with just the subset of df2
    df3 = df3.merge(right=df2[this_column_sublist], how="inner", on="Name")

就像我说的，这是一个丑陋的解决办法，但可能会奏效。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

具有巨大数据帧的内部联接（约200万列）

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >