从JSON文件中检测句子并提取相关实体

2024-09-29 21:39:53 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据集,它表示解析文本的一句话,如下所示:

[{
    "address": 1,
    "ctag": "Ne",
    "feats": "_",
    "head": 6,
    "lemma": "Ashraf",
    "rel": "SBJ",
    "tag": "Ne",
    "word": "Ashraf"
}, {
    "address": 2,
    "ctag": "AJ",
    "feats": "_",
    "head": 1,
    "lemma": "Ghani",
    "rel": "NPOSTMOD",
    "tag": "AJ",
    "word": "Ghani"
}, {
    "address": 3,
    "ctag": "P",
    "feats": "_",
    "head": 6,
    "lemma": "in",
    "rel": "ADV",
    "tag": "P",
    "word": "in"
}, {
    "address": 4,
    "ctag": "N",
    "feats": "_",
    "head": 3,
    "lemma": "Kabul",
    "rel": "POSDEP",
    "tag": "N",
    "word": "Kabul"
}, {
    "address": 5,
    "ctag": "N",
    "feats": "_",
    "head": 6,
    "lemma": "born",
    "rel": "NVE",
    "tag": "N",
    "word": "born"
}, {
    "address": 6,
    "ctag": "V",
    "feats": "_",
    "head": 0,
    "lemma": "شدشو",
    "rel": "ROOT",
    "tag": "V",
    "word": "شده_است"
}, {
    "address": 7,
    "ctag": "PUNC",
    "feats": "_",
    "head": 6,
    "lemma": ".",
    "rel": "PUNC",
    "tag": "PUNC",
    "word": "."
}]

"Adress": 7中,"ctag":"PUNC"表示句子的结尾。我的原始数据集包含几个句子。首先,我想检测这是第一个句子,从每个以PUNC.结尾的句子中,我想首先检查,在第一个句子中提取像'ctag'='Ne'这样的特殊two or three实体,对于下一个单词'ctag'= 'N',然后这两个实体之间的关系是'rel'= 'NVE',然后它应该存储在一个列表中。你知道吗

我所做的:

# read file
with open('../data/parse.txt', 'r') as myfile:
    obj = json.load(myfile)
for w in obj:
    if w['ctag'] == 'Ne' and w['rel'] == 'SBJ':
        n1.append(w['word'])
    if w['ctag'] == 'N' and w['rel'] == 'SBJ':
        n6.append(w['word'])
    if w['ctag'] == 'N' and w['rel'] == 'MOZ':
        n2.append(w['word'])
    if w['rel'] == 'NVE' and w['ctag'] == 'N':
        n3.append(w['word'])
    if w['rel'] == 'MOZ' and w['ctag'] =='Ne':
        n4.append(w['word'])
    if w['rel'] =='MOS' and w['ctag'] == 'Ne':
        n5.append(w['word'])
    if w['rel'] == 'OBJ' and w['ctag'] == 'N':
       n7.append(w['word'])

这意味着从27地址我发现了这个数量的实体:

rel=SBJ & Ne: ['Ashraf']
rel=MOZ & Ne ['President', 'Capital', 'Lecturer', 'University']
rel=MOS & Ne ['Ashraf', 'Kabul', 'Ahmad']
rel=MOZ & N ['Afghanistan', 'Afghanistan', 'Kabul']
rel=NVE & N ['born']
rel=SBJ & N ['Kabul']
rel=OBJ & N ['Located']

我想要的是: -它应该找到第一个".""PUNC",然后在第一句话中,检查实体,所以我发现if word['ctag'] =='Ne' and word['rel'] == 'MOZ':,然后这些是关系实体,其余的是命名实体,如subject & object。 -然后它应该转到下一个"PUNC"并检索Nerel。你知道吗

--->;我期望每个句子的输出: (e1, relation, e2)-->;(Kabul, located, Afghanistan)


Tags: andifaddresstaghead句子wordrel

热门问题