Elasticsearch脚本使用稠密向量查询余弦相似性给出“类\u强制\u异常”错误

2024-09-27 23:27:10 发布

您现在位置:Python中文网/ 问答频道 /正文

执行此查询时,我正在使用Elasticsearch 7.9.0版:

curl -XGET 'https:somehost:9200/index_name/_search' -H 'Content-Type: application/json' -d '{
    "size": 10,
    "query": {
        "script_score": {
            "query": {
                "match_all": {}
            },
            "script": {
                "source": "cosineSimilarity(params.query_vector, \u0027title_embed\u0027) + 1.0",
                "params": {
                    "query_vector": [-0.19277021288871765, 0.10494251549243927,.......]}
            }
        }
    }
}'

注意:query_vector是由Bert生成的768维向量。 注意:\u0027是单引号的Unicode

我得到了这个错误的回应:

    "cosineSimilarity(params.query_vector, 'title_embed') + 1.0","                   
                   ^---- HERE"],"script":"cosineSimilarity(params.query_vector, 'title_embed') + 
1.0","lang":"painless","position":{"offset":38,"start":0,"end":58},"caused_by":
{"type":"class_cast_exception","reason":"class 
org.elasticsearch.index.fielddata.ScriptDocValues$Doubles cannot be cast to class 
org.elasticsearch.xpack.vectors.query.VectorScriptDocValues$DenseVectorScriptDocValues 
(org.elasticsearch.index.fielddata.ScriptDocValues$Doubles is in unnamed module of loader 'app'; 
org.elasticsearch.xpack.vectors.query.VectorScriptDocValues$DenseVectorScriptDocValues is in 
unnamed module of loader java.net.FactoryURLClassLoader @715fb77)"}}}]},"status":400}

虽然索引映射中的title_embed的数据类型是Elasticsearch的dense_vector类型,但错误表明它是双精度的,我不知道为什么

以下是映射:

"mappings": {
    "properties": {
        "description": {
            "type": "text",
            "fields": {
                "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                }
            }
        },
        "domain": {
            "type": "text",
            "fields": {
                "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                }
            }
        },
        "link": {
            "type": "text",
            "fields": {
                "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                }
            }
        },
        "pub_date": {
            "type": "date"
        },
        "title": {
            "type": "text",
            "fields": {
                "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                }
            }
        },
        "title_embed": {
            "type": "dense_vector",
            "dims": 768
        },
        "description_embed": {
            "type": "dense_vector",
            "dims": 768
        }
    }
}

当我尝试使用python执行此查询时,我收到了相同的错误:

status_code, error_message, additional_info
elasticsearch.exceptions.RequestError: RequestError(400, 'search_phase_execution_exception', "class_cast_exception: class org.elasticsearch.index.fielddata.ScriptDocValues$Doubles cannot be cast to class org.elasticsearch.xpack.vectors.query.VectorScriptDocValues$DenseVectorScriptDocValues (org.elasticsearch.index.fielddata.ScriptDocValues$Doubles is in unnamed module of loader 'app'; org.elasticsearch.xpack.vectors.query.VectorScriptDocValues$DenseVectorScriptDocValues is in unnamed module of loader java.net.FactoryURLClassLoader @6d91790b)")

Tags: orgindextitletypeembedparamselasticsearchquery
1条回答
网友
1楼 · 发布于 2024-09-27 23:27:10

如果可能,检查变量数量是否等于映射中的维度数量,即

dims:768

“查询向量”中的值数是否等于768

我建议再次检查映射,通过运行以下命令查看映射是否良好:

GET index_name/_mapping

此外,在传递“query_vector”时,您可能遗漏了一个值

我做了一个局部测试,但是,向量是3维的

标题_嵌入的映射为3,类型为“稠密_向量”

我在映射中摄取了一些数据,如下所示:

POST /index_name/_doc
{
  "title_embed": [10.01,15,15]
}

我尝试用较低的向量维度复制您的查询,如上所述:

{
"size": 10,
    "query": {
        "script_score": {
            "query": {
                "match_all": {}
            },
            "script": {
                "source": "cosineSimilarity(params.query_vector,'title_embed') + 1.0",
                "params": {
                    "query_vector": [-0.19277021288871765, 0.10494251549243927,12.202022]
                
                }
            }
        }
    }
}

注意:正如Tom Elias提到的,运行doc['title_embed']可以工作,但在7.9.0版中不推荐使用

一个小小的建议是,当在映射的同时摄取索引中的数据时,是否可以尝试通过减少向量维度来降低维度。如果维度数为5,则检查映射中的“dim”值是否为5,同时将数据摄取到索引和“query_vector”中

"query_vector": [12,-1020.02000,10,-5.0000,2]

如果这不起作用,我想可能对允许的维度数量有一个内部限制

有用链接: https://www.elastic.co/guide/en/elasticsearch/reference/7.x/query-dsl-script-score-query.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html

相关问题 更多 >

    热门问题