R中的数据重组

import sys def processOne(fname): clusters = {} nextCluster = 1 with open(fname + ".csv", "r") as f: for line in f: line = line.strip() if line == "site,run,id,payload,dir": continue (site, run, id, payload, dir) = line.split(',') clind = ",".join((site,run,id)) clust = clusters.setdefault(clind, { "i":nextCluster, "1":0, "2":0 }) if clust["i"] == nextCluster: nextCluster += 1 clust[dir] += 1 if clust[dir] > 20: continue sys.stdout.write("{label},{i},{dir},{j},{payload}\n" .format(label=fname, i=clust["i"], dir=dir, j=clust[dir], payload=payload)) sys.stdout.write("label,stream,dir,i,payload\n") for fn in sys.argv[1:]: processOne(fn)

2条回答

网友

1楼 · 编辑于 2024-10-01 17:27:03

完成所需步骤的R代码：

“其中'label'从CSV文件名派生；”

filvec <- list.files(<path>)
for (fil in filvec) {  #all the statements will be in the loop body
  dat <- read.csv(fil)
  dat$label <- fil   # recycling will make all the elements the same character value

“stream”是分配给一个文件中“site”、“run”和“id”的每个组合的序列号（因此，只有在“label”中是唯一的）

^{pr2}$

“'i'是每个'stream'中的行号；”

dat$i <- ave(dat$site,     # could be any column since we are not using its values
             dat$stream,   # 'ave' passes grouped vectors, returns same length vector
             FUN= function(x) 1:length(x) )

“‘dir’和‘payload’直接取自原始文件。”

 # you can refer to them by name or column number

“我还想丢弃每个流的前20行以外的所有行。”在

 out <- dat[dat$i <= 20,     # logical test for the "first 20"
             c('label','stream','dir','i','payload') ]  # chooses columns desired
     }  # end of loop

实际上，目前这将覆盖三个'dat'文件。（因此，对于速度检查的一次性测试运行来说，这将非常有用。）您可以进行最后一次调用，例如：

  assign(paste(fil, "out", sep="_"), dat[dat$i <= 20,
                                          c('label','stream','dir','i','payload') ] )

网友

2楼 · 编辑于 2024-10-01 17:27:03

data.table包通常会加速大型到大型的操作数据帧. 在

例如，下面的代码使用三个500000行数据帧作为输入，并在我不太强大的笔记本电脑上执行您描述的所有转换。在

library(data.table)

## Create a list of three 500000 row data.frames
df <- expand.grid(site=1:2, run=1:2, id=1:2)
df <- data.frame(df, payload=1:1000, dir=rep(1, 5e5))
dfList <- list(df, df, df)
dfNames <- c("firstCSV", "secondCSV", "thirdCSV")

## Manipulate the data with data.table, and time the calculations
system.time({
outputList <-
    lapply(1:3, FUN = function(ii) {
        label <- dfNames[ii]
        df <- dfList[[ii]]
        dt <- data.table(df, key=c("site", "run", "id"))
        groups <- unique(dt[,key(dt), with=FALSE])
        groups[, stream := seq_len(nrow(groups))]
        dt <- dt[groups]
        # Note: The following line only keeps the first 3 (rather than 20) rows
        dt <- dt[, head(cbind(.SD, i=seq_len(.N)), 3), by=stream]
        dt <- cbind(label, dt[,c("stream", "dir", "i", "payload")])
        df <- as.data.frame(dt)
        return(df)
    })
output <- do.call(rbind, outputList)
})
##    user  system elapsed 
##    1.25    0.18    1.44 

## Have a look at the output
rbind(head(output,4), tail(output,4))

编辑：在2012年5月8日，我通过替换这行代码将上述代码的运行时间减少了约25%：

^{pr2}$

对于这两个：

dt <- cbind(dt, i = dt[, list(i=seq_len(.N)), by=stream][[2]])
dt <- dt[i<=3,]  # Note: This only keeps the 1st 3 (rather than 20) rows

相关问题更多 >

编程相关推荐

热门问题

热门文章