将列表中的文本与json属性SPacy匹配

2024-06-23 19:00:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图将列表中存储的文本与json文件中的json属性进行匹配。 到目前为止,我设法将其匹配为1:1,这意味着json和列表中的文本必须完全相同,这既不可取也不有用

例如: messages.json

[
 {
    "id": "1",
    "task_id": "1",
    "team": "Top",
    "message": "Failure indicated something else [gdfgdfgg]",
  },
 {
    "id": "2",
    "task_id": "2",
    "team": "Ten",
    "message": "Internal server error 500 something else [dasdasdasdasdas]",
  }
]

因此,根据这个JSON,我只想将属性messageFailure indicatedInternal server error 500匹配,后面没有文本,有很多这样的消息,因此无法逐个替换所有消息

到目前为止,我尝试的是:

import json
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher

message_list = ['Failure indicated','Internal server error 500']

def matching_data(data):
   nlp= English()
   extract_data: list = [msg["message"] for msg in data]
   
   matcher= PhraseMatcher(nlp.vocab, attr="LOWER")
   patterns= [nlp.make_doc(msg) for msg in extract_data]
   matcher.add("Messages", None, *patterns)

   match_check= any([item in extract_data for item in message_list])
    if not match_check:
        print("No matches found")
    else:
        for msg in message_list:
            doc= nlp(msg)
            for match_id, start, end in matcher(doc):
                print("Message matched based on lowercase token text:", doc[start:end])

matching_data(json.loads(open("messages.json").read()))


Tags: in文本idjsonmessagefordatadoc
1条回答
网友
1楼 · 发布于 2024-06-23 19:00:59

查看关于如何向短语匹配器添加模式的spacy documentation。首先,将message_list中的短语添加到短语匹配器中,然后在从json文件提取的消息列表中找到这些模式

import json
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher

with open('messages.json') as f:
    data = json.load(f)
    extract_data = [msg["message"] for msg in data]

nlp = English()
matcher= PhraseMatcher(nlp.vocab, attr="LOWER")
message_list = ['Failure indicated', 'Internal server error 500']
# add the multi-token phrases that you want to find to the PhraseMatcher  
patterns = [nlp.make_doc(text) for text in message_list]
matcher.add("MessageList", None, *patterns)

for msg in extract_data:
    doc = nlp(msg)
    matches = matcher(doc)
    for match_id, start, end in matches:
        print("Matched based on lowercase token text:", doc[start:end])

相关问题 更多 >

    热门问题