改进D中的线性I/O操作

3条回答

网友

1楼 · 编辑于 2024-06-18 13:01:38

在文本处理应用程序中，计算行数是否是整体性能的一个很好的代理，这是有争议的。您正在测试python的C库的效率，就像其他任何东西一样，一旦您真正开始使用数据做有用的事情，您将得到不同的结果。D比Python花更少的时间来完善标准库，而且涉及的人员也更少。beyline的性能已经讨论了几年了，我认为下一个版本会更快。在

人们似乎确实发现D对于这种类型的文本处理是高效和高效的。例如，AdRoll是众所周知的python商店，但他们的数据科学人员使用D:

http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html

回到问题上来，我们显然是在比较编译器和库，就像比较语言一样。DMD的作用是作为参考编译器，并且编译速度非常快。因此，它对于快速开发和迭代非常有用，但是如果您需要速度，那么应该使用LDC或GDC，如果您确实使用DMD，那么就打开优化并关闭边界检查。在

在我的arch linux 64位HP Probook 4530s机器上，使用WestburyLab usenet语料库的最后1毫米行，我得到以下信息：

python2：实数0m0.333s，用户0m0.253s，sys 0m0.013s

pypy（预热）：实数0m0.286s，用户0m0.250s，sys 0m0.033s

DMD（默认值）：实数0m0.468s，用户0m0.460s，sys 0m0.007s

DMD（-O-释放-内联-noboundscheck）：实0m0.398s，用户0m0.393s，sys 0m0.003s

GDC（默认）：real 0m0.400s，user 0m0.380s，sys 0m0.017s [我不知道用于GDC优化的开关]

LDC（默认）：real 0m0.396s，用户0m0.380s，sys 0m0.013s

LDC（-O5）：实数0m0.336s，用户0m0.317s，sys 0m0.017s

在一个实际的应用程序中，我们将使用内置的探查器来识别热点并调整代码，但我同意naived应该是一个不错的速度，最糟糕的情况下应该与python处于相同的水平。使用LDC进行优化，这正是我们所看到的。在

为了完整起见，我将您的D代码改为以下代码。（有些进口货是不需要的-我只是在玩玩）。在

import std.stdio;
import std.string;
import std.datetime;
import std.range, std.algorithm;
import std.array;

int main(string[] args)
{
  if (args.length < 2) {
    return 1;
  }
  auto t=Clock.currTime();
  auto infile = File(args[1]);
  uint linect = 0;
  foreach (line; infile.byLine)
    linect += 1;
  auto t2=Clock.currTime-t;
  writefln("There are: %s lines and took %s", linect, t2);
  return 1;
}

网友

2楼 · 编辑于 2024-06-18 13:01:38

我想我今天会做一些新的事情，所以我决定“学习”d。请注意，这是我写的第一个d，所以我可能会完全离开。在

我首先尝试的是手动缓冲：

foreach (chunk; infile.byChunk(100000)) {
    linect += splitLines(cast(string) chunk).length;
}

请注意，这是不正确的，因为它忽略了跨越边界的线，但稍后会进行修复。在

这有点帮助，但还远远不够。它确实允许我测试

^{pr2}$

这表明所有的时间都在splitLines。在

我做了^{}的本地副本。仅此一项，速度就提高了2倍！我没想到会这样。我两个都跑

dmd -release -inline -O -m64 -boundscheck=on
dmd -release -inline -O -m64 -boundscheck=off

不管怎样都差不多。在

然后我重写了splitLines，专门针对s[i].sizeof == 1，这似乎只比Python慢，因为它也破坏了段落分隔符。在

为了完成它，我做了一个范围并进一步优化了它，这使得代码接近Python的速度。考虑到Python不会中断段落分隔符，而且它的底层代码是用C编写的，这似乎没问题。这段代码在长度超过8k的行上可能具有O(n²)性能，但我不确定。在

import std.range;
import std.stdio;

auto lines(File file, KeepTerminator keepTerm = KeepTerminator.no) {
    struct Result {
        public File.ByChunk chunks;
        public KeepTerminator keepTerm;
        private string nextLine;
        private ubyte[] cache;

        this(File file, KeepTerminator keepTerm) {
            chunks = file.byChunk(8192);
            this.keepTerm = keepTerm;

            if (chunks.empty) {
                nextLine = null;
            }
            else {
                // Initialize cache and run an
                // iteration to set nextLine
                popFront;
            }
        }

        @property bool empty() {
            return nextLine is null;
        }

        @property auto ref front() {
            return nextLine;
        }

        void popFront() {
            size_t i;
            while (true) {
                // Iterate until we run out of cache
                // or we meet a potential end-of-line
                while (
                    i < cache.length &&
                    cache[i] != '\n' &&
                    cache[i] != 0xA8 &&
                    cache[i] != 0xA9
                ) {
                    ++i;
                }

                if (i == cache.length) {
                    // Can't extend; just give the rest
                    if (chunks.empty) {
                        nextLine = cache.length ? cast(string) cache : null;
                        cache = new ubyte[0];
                        return;
                    }

                    // Extend cache
                    cache ~= chunks.front;
                    chunks.popFront;
                    continue;
                }

                // Check for false-positives from the end-of-line heuristic
                if (cache[i] != '\n') {
                    if (i < 2 || cache[i - 2] != 0xE2 || cache[i - 1] != 0x80) {
                        continue;
                    }
                }

                break;
            }

            size_t iEnd = i + 1;
            if (keepTerm == KeepTerminator.no) {
                // E2 80 A9 or E2 80 A9
                if (cache[i] != '\n') {
                    iEnd -= 3;
                }
                // \r\n
                else if (i > 1 && cache[i - 1] == '\r') {
                    iEnd -= 2;
                }
                // \n
                else {
                    iEnd -= 1;
                }
            }

            nextLine = cast(string) cache[0 .. iEnd];
            cache = cache[i + 1 .. $];
        }
    }

    return Result(file, keepTerm);
}

int main(string[] args)
{
    if (args.length < 2) {
        return 1;
    }

    auto file = File(args[1]);
    writeln("There are: ", walkLength(lines(file)), " lines.");

    return 0;
}

网友

3楼 · 编辑于 2024-06-18 13:01:38

EDIT和TL；DR：这个问题已经在https://github.com/D-Programming-Language/phobos/pull/3089中解决了。改进的File.byLine性能将从d2.068开始提供。在

我在一个575247行的文本文件中尝试了你的代码。Python基线大约需要0.125秒。下面是我的代码库，每个方法的注释中都嵌入了计时。解释如下。在

import std.algorithm, std.file, std.stdio, std.string;

int main(string[] args)
{
  if (args.length < 2) {
    return 1;
  }
  size_t linect = 0;

  // 0.62 s
  foreach (line; File(args[1]).byLine())
    linect += 1;

  // 0.2 s
  //linect = args[1].readText.count!(c => c == '\n');

  // 0.095 s
  //linect = args[1].readText.representation.count!(c => c == '\n');

  // 0.11 s
  //linect = File(args[1]).byChunk(4096).joiner.count!(c => c == '\n');

  writeln("There are: ", linect, " lines.");
  return 0;
}

我为每个变体使用dmd -O -release -inline。在

第一个版本（最慢）一次读取一行。我们可以而且应该改进byLine的性能；目前，由于将byLine与其他C stdio操作混合使用等原因，它受到了阻碍，这可能是过于保守了。如果我们去掉这些，我们可以很容易地进行预取等

第二个版本一下子读取文件，然后使用一个标准算法计算带有谓词的行数。在

第三个版本承认这样一个事实，即不需要考虑UTF的任何细微之处；计算字节也是很好的，因此它将字符串转换为字节表示（不需要任何代价），然后对字节进行计数。在

最后一个版本（我最喜欢的）一次从文件中读取4KB的数据，并使用joiner缓慢地将它们展平。然后再次计算字节数。在

相关问题更多 >

编程相关推荐

热门问题

热门文章