有 Java 编程相关的问题?

你可以在下面搜索框中键入要查询的问题!

java定制Solr TokenFilter lemmatizer

我试图编写一个简单的Solr lemmatizer,用于字段类型,但我似乎找不到任何关于编写令牌过滤器的信息,所以我有点迷路了。这是我目前掌握的代码

import java.io.IOException;
import java.util.List;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

class FooFilter extends TokenFilter {

    private static final Logger log = LoggerFactory.getLogger(FooFilter.class);
    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    private final PositionIncrementAttribute posAtt = addAttribute(PositionIncrementAttribute.class);

    public FooFilter(TokenStream input) {
        super(input);
    }

    @Override
    public boolean incrementToken() throws IOException {
        if (!input.incrementToken()) {
            return false;
        }

        char termBuffer[] = termAtt.buffer();
        List<String> allForms = Lemmatize.getAllForms(new String(termBuffer));
        if (allForms.size() > 0) {
            for (String word : allForms) {
                // Now what?
            }
        }

        return true;
    }
}

共 (1) 个答案

  1. # 1 楼答案

    接下来,你想用你的单词replaceappend当前的标记termAtt

    示例替换语义

    termAtt.setEmpty();
    termAtt.copyBuffer(word.toCharArray(), 0, word.length());
    

    添加新标记的示例语义

    对于要添加的每个标记,必须设置CharTermAttribute属性,并且incrementToken例程返回true

    private List<String> extraTokens = ...
    public boolean incrementToken() { 
      if (input.incrementToken()){ 
        // ... 
        return true; 
      } else if (!extraTokens.isEmtpy()) { 
        // set the added token and return true
        termAtt.setTerm(extraTokens.remove(0)); 
        return true; 
      } 
      return false; 
    }