<p><code>data.table</code>包通常会加速大型到大型的操作数据帧. 在</p>
<p>例如,下面的代码使用三个500000行数据帧作为输入,并在我不太强大的笔记本电脑上执行您描述的所有转换。在</p>
<pre><code>library(data.table)
## Create a list of three 500000 row data.frames
df <- expand.grid(site=1:2, run=1:2, id=1:2)
df <- data.frame(df, payload=1:1000, dir=rep(1, 5e5))
dfList <- list(df, df, df)
dfNames <- c("firstCSV", "secondCSV", "thirdCSV")
## Manipulate the data with data.table, and time the calculations
system.time({
outputList <-
lapply(1:3, FUN = function(ii) {
label <- dfNames[ii]
df <- dfList[[ii]]
dt <- data.table(df, key=c("site", "run", "id"))
groups <- unique(dt[,key(dt), with=FALSE])
groups[, stream := seq_len(nrow(groups))]
dt <- dt[groups]
# Note: The following line only keeps the first 3 (rather than 20) rows
dt <- dt[, head(cbind(.SD, i=seq_len(.N)), 3), by=stream]
dt <- cbind(label, dt[,c("stream", "dir", "i", "payload")])
df <- as.data.frame(dt)
return(df)
})
output <- do.call(rbind, outputList)
})
## user system elapsed
## 1.25 0.18 1.44
## Have a look at the output
rbind(head(output,4), tail(output,4))
</code></pre>
<hr/>
<p><strong>编辑</strong>:在2012年5月8日,我通过替换这行代码将上述代码的运行时间减少了约25%:</p>
^{pr2}$
<p>对于这两个:</p>
<pre><code>dt <- cbind(dt, i = dt[, list(i=seq_len(.N)), by=stream][[2]])
dt <- dt[i<=3,] # Note: This only keeps the 1st 3 (rather than 20) rows
</code></pre>