Elasticsearch mtermvectors python API查询

2024-10-04 05:32:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我目前正在使用索引中的大量文档(大约500K)进行Elasticsearch。我想在另一个索引中存储每个文档的n克文本数据(这也是巨大的~每个文档包含2页文本数据)。因此,我计算每个文档中的术语向量及其计数,以将它们存储在新索引中。因此,我可以对新索引执行聚合查询

旧索引的设置使我能够执行termvectormtermvectorAPI。我不想在短时间内对Elasticsearch服务器发出太多的请求,所以我将使用mtermvectors python API。我试图通过传递25个文档的id来获取25个文档的术语向量

在python中调用mtermvector API后的HTTP URL示例

http://*servername*/elastic/*indexname*/article/_mtermvectors?offsets=false&fields=plain_text&ids=608467%2C608469%2C608473%2C608475%2C608477%2C608482%2C608485%2C608492%2C608498%2C608504%2C608509%2C608511%2C608520%2C608522%2C608528%2C608530%2C608541%2C608549%2C608562%2C608570%2C608573%2C608576%2C608577%2C608579%2C608585&field_statistics=true&term_statistics=true&payloads=false&positions=false

有时我会得到预期的响应,但大多数情况下我会得到以下错误:

Proxy Error
The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request GET /elastic/*indexname*/article/_mtermvectors.

Reason: Error reading from remote server

索引设置和映射

{
  "settings": {
    "analysis": {
      "analyzer": {
        "shingleAnalyzer": {
          "tokenizer": "letter_tokenizer",
          "filter": [
            "lowercase",
            "custom_stop",
            "custom_shingle",
            "custom_stemmer",
            "length_filter"
          ]
        }
      },
      "filter": {
        "custom_stemmer": {
          "type": "stemmer",
          "name": "english"
        },
        "custom_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "custom_shingle": {
          "type": "shingle",
          "min_shingle_size": "2",
          "max_shingle_size": "4",
          "filler_token":""
        },
        "length_filter": {
          "type": "length",
          "min": 2
        }
      },
      "tokenizer": {
        "letter_tokenizer": {
          "type": "letter"
        }
      }
    }
  },
  "mappings": {
    "properties": {"article_id":{"type": "text"},
      "plain_text": {
        "term_vector": "with_positions_offsets_payloads",
        "store": true,
        "analyzer": "shingleAnalyzer",
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

我不认为这个设置和映射有任何问题,因为有时我会得到预期的响应

如果您需要我方提供更多信息,请告诉我。任何帮助都将不胜感激。


Tags: text文档falsetrueservertypecustomarticle