改进D中的线性I/O操作问题的回答

改进D中的线性I/O操作

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我需要以行方式处理大量的大中型文件（几百MB到GB），所以我对迭代行的标准D方法感兴趣。<code>foreach(line; file.byLine())</code>这个习惯用法似乎很合适，而且简洁易读，但是性能似乎不太理想。在 例如，下面是Python和D中的两个小程序，用于迭代文件的行并计算行数。对于一个约470 MB的文件（~3.6M行），我得到以下计时（最好是10次）： D次： <pre><code>real 0m19.146s user 0m18.932s sys 0m0.190s </code></pre> Python时间（在编辑2之后，见下文）： ^{pr2}$ 以下是用<code>dmd -O -release -inline -m64</code>编译的D版本： <pre><code>import std.stdio; import std.string; int main(string[] args) { if (args.length < 2) { return 1; } auto infile = File(args[1]); uint linect = 0; foreach (line; infile.byLine()) linect += 1; writeln("There are: ", linect, " lines."); return 0; } </code></pre> 现在对应的Python版本： <pre><code>import sys if __name__ == "__main__": if (len(sys.argv) < 2): sys.exit() infile = open(sys.argv[1]) linect = 0 for line in infile: linect += 1 print "There are %d lines" % linect </code></pre> 编辑2：我修改了Python代码，使用了下面注释中建议的更加惯用的<code>for line in infile</code>，这使得Python版本的速度更快，现在已经接近对Unix<code>wc</code>工具的标准<code>wc -l</code>调用的速度。在 有没有什么建议或建议可以指出我在D中可能做错了什么，那就是表现如此糟糕？在 EDIT：为了进行比较，这里有一个D版本，它将<code>byLine()</code>习语抛出窗口，一次将所有数据吸入内存，然后将数据拆分成多行。这提供了更好的性能，但仍然比Python版本慢2倍左右。在 <pre><code>import std.stdio; import std.string; import std.file; int main(string[] args) { if (args.length < 2) { return 1; } auto c = cast(string) read(args[1]); auto l = splitLines(c); writeln("There are ", l.length, " lines."); return 0; } </code></pre> 最后一个版本的时间安排如下： <pre><code>real 0m3.201s user 0m2.820s sys 0m0.376s </code></pre>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

在文本处理应用程序中，计算行数是否是整体性能的一个很好的代理，这是有争议的。您正在测试python的C库的效率，就像其他任何东西一样，一旦您真正开始使用数据做有用的事情，您将得到不同的结果。D比Python花更少的时间来完善标准库，而且涉及的人员也更少。beyline的性能已经讨论了几年了，我认为下一个版本会更快。在 人们似乎确实发现D对于这种类型的文本处理是高效和高效的。例如，AdRoll是众所周知的python商店，但他们的数据科学人员使用D: <a href="http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html" rel="nofollow">http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html</a> 回到问题上来，我们显然是在比较编译器和库，就像比较语言一样。DMD的作用是作为参考编译器，并且编译速度非常快。因此，它对于快速开发和迭代非常有用，但是如果您需要速度，那么应该使用LDC或GDC，如果您确实使用DMD，那么就打开优化并关闭边界检查。在 在我的arch linux 64位HP Probook 4530s机器上，使用WestburyLab usenet语料库的最后1毫米行，我得到以下信息： python2：实数0m0.333s，用户0m0.253s，sys 0m0.013s pypy（预热）：实数0m0.286s，用户0m0.250s，sys 0m0.033s DMD（默认值）：实数0m0.468s，用户0m0.460s，sys 0m0.007s DMD（-O-释放-内联-noboundscheck）：实0m0.398s，用户0m0.393s，sys 0m0.003s GDC（默认）：real 0m0.400s，user 0m0.380s，sys 0m0.017s [我不知道用于GDC优化的开关] LDC（默认）：real 0m0.396s，用户0m0.380s，sys 0m0.013s LDC（-O5）：实数0m0.336s，用户0m0.317s，sys 0m0.017s 在一个实际的应用程序中，我们将使用内置的探查器来识别热点并调整代码，但我同意naived应该是一个不错的速度，最糟糕的情况下应该与python处于相同的水平。使用LDC进行优化，这正是我们所看到的。在 为了完整起见，我将您的D代码改为以下代码。（有些进口货是不需要的-我只是在玩玩）。在 <pre><code>import std.stdio; import std.string; import std.datetime; import std.range, std.algorithm; import std.array; int main(string[] args) { if (args.length < 2) { return 1; } auto t=Clock.currTime(); auto infile = File(args[1]); uint linect = 0; foreach (line; infile.byLine) linect += 1; auto t2=Clock.currTime-t; writefln("There are: %s lines and took %s", linect, t2); return 1; } </code></pre>

改进D中的线性I/O操作

1 个回答

相关Python问题