使用复杂条件更新elasticsearch索引

2024-10-02 14:19:06 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在处理英国2017年大选数据。我有csv文件格式和Elasticsearch索引。以下是来自Elasticsearch索引的Chichester选区样本:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 6,
      "relation" : "eq"
    },
    "max_score" : 8.03183,
    "hits" : [
      {
        "_index" : "ge",
        "_type" : "_doc",
        "_id" : "eCtGCG4BaIAfLxq_V2By",
        "_score" : 8.03183,
        "_source" : {
          "code" : "E14000633",
          "PANO" : "145",
          "constituency" : "Chichester",
          "last_name" : "EMERSON",
          "first_name" : "Andrew",
          "party" : "Patria",
          "Party Identifer" : "Patria",
          "votes" : "84"
        }
      },
      {
        "_index" : "ge",
        "_type" : "_doc",
        "_id" : "eStGCG4BaIAfLxq_V2By",
        "_score" : 8.03183,
        "_source" : {
          "code" : "E14000633",
          "PANO" : "145",
          "constituency" : "Chichester",
          "last_name" : "MONCREIFF",
          "first_name" : "Andrew Malcolm",
          "party" : "UK Independence Party (UKIP)",
          "Party Identifer" : "UKIP",
          "votes" : "1650"
        }
      },
      {
        "_index" : "ge",
        "_type" : "_doc",
        "_id" : "eitGCG4BaIAfLxq_V2By",
        "_score" : 8.03183,
        "_source" : {
          "code" : "E14000633",
          "PANO" : "145",
          "constituency" : "Chichester",
          "last_name" : "BARRIE",
          "first_name" : "Heather Margaret",
          "party" : "Green Party",
          "Party Identifer" : "Green Party",
          "votes" : "1992"
        }
      },
      {
        "_index" : "ge",
        "_type" : "_doc",
        "_id" : "eytGCG4BaIAfLxq_V2By",
        "_score" : 8.03183,
        "_source" : {
          "code" : "E14000633",
          "PANO" : "145",
          "constituency" : "Chichester",
          "last_name" : "BROWN",
          "first_name" : "Jonathan",
          "party" : "Liberal Democrats",
          "Party Identifer" : "Liberal Democrats",
          "votes" : "6749"
        }
      },
      {
        "_index" : "ge",
        "_type" : "_doc",
        "_id" : "fCtGCG4BaIAfLxq_V2By",
        "_score" : 8.03183,
        "_source" : {
          "code" : "E14000633",
          "PANO" : "145",
          "constituency" : "Chichester",
          "last_name" : "FARWELL",
          "first_name" : "Mark Andrew",
          "party" : "Labour Party",
          "Party Identifer" : "Labour",
          "votes" : "13411"
        }
      },
      {
        "_index" : "ge",
        "_type" : "_doc",
        "_id" : "fStGCG4BaIAfLxq_V2By",
        "_score" : 8.03183,
        "_source" : {
          "code" : "E14000633",
          "PANO" : "145",
          "constituency" : "Chichester",
          "last_name" : "KEEGAN",
          "first_name" : "Gillian",
          "party" : "The Conservative Party Candidate",
          "Party Identifer" : "Conservative",
          "votes" : "36032"
        }
      }
    ]
  }
}

我想创建一个新的“列”,称为“排名”,然后选择每个不同的选区,并为相关候选人添加适当的数字。因此,在上面的例子中,保守党候选人的排名为1,工党候选人的排名为2,依此类推

每个选区的候选人人数并不相同

一些最终目标是: 1) 计算并分组每个政党的席位数 2) 要选择那些选区,多数是最小的,并对它们进行排序 3) 写一个算法,指出战术选民应该做出什么选择(当然取决于你想要的结果)

我不知道该怎么做(除了手动更新原始电子表格)

是否应该通过编程方式将cUrl命令直接放入集群中?或者使用Python脚本处理csv文件

请有人建议最好的方法,并提供一个代码示例

我的第一个想法是为每个不同的选区对返回的对象进行排序,使用总点击数循环遍历数据并在此基础上更新排名字段。我同意这一点:

curl -X POST "localhost:9200/ge/_search?pretty" -H 'Content-Type: application/json' -d'
{
   "query" : {
      "term" : { "Constituency" : "Aldershot" }
   },
   "sort" : [
      {"votes.keyword" : {"order" : "desc"}}
   ]
}'

返回一个空的数据集。所以我被卡住了。 感谢所有的帮助


Tags: nameidsourceindexdocpartytypecode