ElasticSearch:检索字段及其规范化

2024-09-30 22:23:08 发布

您现在位置:Python中文网/ 问答频道 /正文

我想从Elasticsearch检索一个字段及其规范化版本

这是我的索引定义和数据

PUT normalizersample
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1,
    "refresh_interval": "60s",
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "filter": [
            "lowercase",
            "german_normalization",
            "asciifolding"
          ],
          "type": "custom"
        }
      }
    }
  },
  "mappings": {
    "_source": {
      "enabled": true
    },
    "properties": {
      "myField": {
        "type": "text",
        "store": true,
        "fields": {
          "keyword": {
            "type": "keyword",
            "store": true
          },
          "normalized": {
            "type": "keyword",
            "store": true,
            "normalizer": "my_normalizer"
          }
        }
      }
    }
  }
}

POST normalizersample/_doc/1
{
  "myField": ["Andreas", "Ämdreas", "Anders"]
}

我的第一种方法是使用脚本字段,如

GET /myIndex/_search
{
  "size": 100, 
  "query": {
    "match_all": {}
  },
  "script_fields": {
    "keyword": {
      "script": "doc['myField.keyword']"
    },
    "normalized": {
      "script": "doc['myField.normalized']"
    }
  }
}

但是,由于myField是一个数组,因此每个ES文档返回两个字符串列表,并且每个字符串都按字母顺序排序。因此,由于规范化,相应的条目可能彼此不匹配

    "hits" : [
      {
        "_index" : "normalizersample",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "fields" : {
          "de" : [
            "amdreas",
            "anders",
            "andreas"
          ],
          "keyword" : [
            "Anders",
            "Andreas",
            "Ämdreas"
          ]
        }
      }
    ]

虽然我想检索[(Andreas,Andreas),(Ämdreas,amdreas)(Anders,Anders)]或类似的格式,我可以将每个条目与其规范化匹配。 我发现的唯一方法是在两个字段上调用术语向量,因为它们都包含位置字段,但这对我来说似乎是一个巨大的开销。(https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html

有没有更简单的方法来检索带有关键字和规范化字段的元组

非常感谢


Tags: 方法storetruefieldsdoctype规范化keyword