java是一个高效的大型单词数组

6 月，1 周 Questions & Answers 1035

我正在寻找一个Java数据结构来存储一个大文本（大约一百万个单词），这样我就可以按索引获取一个单词（例如，获取531467个单词）

String[]或ArrayList的问题是它们占用了太多内存—在我的环境中，每个字大约40字节

我曾想过使用一个字符串[]，其中每个元素是由10个单词组成的一块，由一个空格连接。这是更有效的内存-约20字节每字；但访问速度要慢得多

有没有更有效的方法来解决这个问题

Tags:

共 (6) 个答案

# 1 楼答案

正如Jon Skeet已经提到的，40mb并不是太大

但是您声明您正在存储一个文本，因此可能有许多相同的字符串。例如，停止像“and”和“or”这样的单词

你可以使用字符串。实习生（）[1]。这将汇集字符串并返回对已存在字符串的引用

intern（）相当慢，因此您可以用一个HashMap替换它，该HashMap将为您执行相同的操作

[1]http://download.oracle.com/javase/6/docs/api/java/lang/String.html#intern%28%29
# 2 楼答案
一种选择是使用UTF-8编码的文本存储字节数组：
```
byte[][] words = ...;
```
然后：
```
public String getWord(int index)
{
   return new String(words[index], "UTF-8");
}
```
这将在两个方面变得更小：
- 每个字符串的数据都是字节[]中的直接，而不是包含两个整数成员和对单独char[]对象的引用的字符串
- 如果您的文本主要是ASCII，那么UTF-8将为这些ASCII字符使用每个字符一个字节
但我并不推荐这种方法。。。同样，它的访问速度较慢，因为每次都需要创建一个新的String。从根本上说，如果你需要一百万个字符串对象（因此你不想每次都支付娱乐罚款），那么你将不得不为一百万个字符串对象使用内存
# 3 楼答案
您可以创建如下数据结构：
- List<string> wordlist
- Dictionary<string, int> tsildrow // for reverse lookup while building the structure
- List<int> wordindex
wordlist将包含所有（唯一）单词的列表， tsildrow将给出wordlist中某个单词的索引，wordindex将告诉您文本中某个特定索引的wordlist索引

您可以按以下方式操作它：
```
for word in text:
    if not word in tsildrow:
        wordlist.append(word)
        tsildrow.add(word, wordlist.last_index)
    wordindex.append(tsildrow[word])
```
这将填充您的数据结构。现在，要查找索引531467中的单词：
```
print wordlist[wordindex[531467]]
```
您可以如下方式复制整个文本：
```
for index in wordindex:
    print wordlist[index] + ' '
```
除了，你仍然会有标点等问题

如果您不想再添加任何单词（即您的文本是稳定的），您可以删除tsildrow以释放一些内存（如果这是您的问题）
# 4 楼答案
```
-XX:+UseCompressedStrings
```
Use a byte[] for Strings which can be represented as pure ASCII. (Introduced in Java 6 Update 21 Performance Release)

http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

似乎是一篇有趣的文章： http://www.javamex.com/tutorials/memory/string_saving_memory.shtml

我听说绳子在存储大字符串的速度方面非常好，尽管不能确定它的记忆能力。但你可能想看看。 http://ahmadsoft.org/ropes/ http://en.wikipedia.org/wiki/Rope_%28computer_science%29
# 5 楼答案

您可以考虑使用memory mapping the data structure，但性能可能会非常糟糕

# 6 楼答案

将所有单词存储在单个字符串中：

class WordList {

    private final String content;
    private final int[] indices;

    public WordList(Collection<String> words) {
        StringBuilder buf = new StringBuilder();
        indices = new int[words.size()];
        int currentWordIndex = 0;
        int previousPosition = 0;
        for (String word : words) {
            buf.append(word);
            indices[currentWordIndex++] = previousPosition;
            previousPosition += word.length();
        }
        content = buf.toString();
    }

    public String wordAt(int index) {
        if (index == indices.length - 1) return content.substring(indices[index]);
        return content.substring(indices[index], indices[index + 1]);
    }

    public static void main(String... args) {
        WordList list = new WordList(Arrays.asList(args));
        for (int i = 0; i < args.length; ++i) {
            System.out.printf("Word %d: %s%n", i, list.wordAt(i));
        }
    }

}

除了它们包含的字符外，使用此解决方案（在indices中的条目）每个单词都有四个字节的开销。检索带有wordAt的单词将始终分配一个新字符串；您可以通过保存StringBuilder的toString()而不是构建器本身来避免这种情况，尽管它在构造时使用了更多内存

根据文本、语言等的类型，您可能需要更好地处理重复出现的单词的解决方案（如the one previously proposed）

Python中文网

有 Java 编程相关的问题?

java是一个高效的大型单词数组

共 (6) 个答案

# 1 楼答案

# 2 楼答案

# 3 楼答案

# 4 楼答案

# 5 楼答案

# 6 楼答案