如何从Lucene 8.6.1索引中获取所有令牌的列表？

1条回答

网友

1楼 · 发布于 2024-10-02 00:20:38

一些历史

你问：我只是想知道IndexReader.terms()是被移动了还是被替换了

LuceneV3方法^{}在LuceneV4中被移动到^{}。这在v4 alpha release notes中有记录

（请记住，Lucene v4早在2012年就发布了。）

v4中AtomicReader中的方法采用field name

正如v4发行说明所述：

One big difference is that field and terms are now enumerated separately: a TermsEnum provides a BytesRef (wraps a byte[]) per term within a single field, not a Term.

其中的关键部分是单个字段中的每个术语。因此，从那时起，不再有一个API调用来检索索引中的所有术语

这种方法一直延续到后来的版本中——除了在Lucene v 5.0.0中AtomicReader和AtomicReaderContext类被重命名为LeafReader和LeafReaderContext。见Lucene-5569

最新版本

这使我们能够访问术语列表，但仅限于每个字段：

下面的代码基于Lucene的最新版本（8.7.0），但对于您提到的版本（8.6.1），也应该是正确的-使用Java的示例：

private void getTokensForField(IndexReader reader, String fieldName) throws IOException {
    List<LeafReaderContext> list = reader.leaves();

    for (LeafReaderContext lrc : list) {
        Terms terms = lrc.reader().terms(fieldName);
        if (terms != null) {
            TermsEnum termsEnum = terms.iterator();

            BytesRef term;
            while ((term = termsEnum.next()) != null) {
                System.out.println(term.utf8ToString());
            }
        }
    }
}

上述示例假设索引如下所示：

private static final String INDEX_PATH = "/path/to/index/directory";
...
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));

如果需要枚举字段名，this question中的代码可以提供一个起点

最后一个音符

我想你也可以在每个文档的基础上访问术语，而不是在评论中提到的每个字段的基础上访问术语。我没有试过这个

相关问题更多 >

编程相关推荐

热门问题

热门文章