如何在Python中解析可Lispreable的属性列表文件

2024-09-29 02:20:20 发布

您现在位置:Python中文网/ 问答频道 /正文

为了使用Python构建NLP应用程序,我正在尝试解析一个动词英语词典,因此我必须将它与我的NLTK脚本合并,词典是一个lisp可读的属性列表文件,但我需要一个更简单的格式,如Json文件或数据框

该词典数据库的一个例子是:

;; Grid: 51.2#1#_th,src#abandon#abandon#abandon#abandon+ingly#(1.5,01269572,01188040,01269413,00345378)(1.6,01524319,01421290,01524047,00415625)###AD

(
 :DEF_WORD "abandon"
 :CLASS "51.2"
 :WN_SENSE (("1.5" 01269572 01188040 01269413 00345378)
            ("1.6" 01524319 01421290 01524047 00415625))
 :PROPBANK ("arg1 arg2")
 :THETA_ROLES ((1 "_th,src"))
 :LCS (go loc (* thing 2)
          (away_from loc (thing 2) (at loc (thing 2) (* thing 4)))
          (abandon+ingly 26))
 :VAR_SPEC ((4 :optional) (2 (animate +)))
)

;; Grid: 45.4.a#1#_ag_th,instr(with)#abase#abase#abase#abase+ed#(1.5,01024949)(1.6,01228249)###AD

(
 :DEF_WORD "abase"
 :CLASS "45.4.a"
 :WN_SENSE (("1.5" 01024949)
            ("1.6" 01228249))
 :PROPBANK ("arg0 arg1 arg2(with)")
 :THETA_ROLES ((1 "_ag_th,instr(with)"))
 :LCS (cause (* thing 1)
       (go ident (* thing 2)
           (toward ident (thing 2) (at ident (thing 2) (abase+ed 9))))
       ((* with 19) instr (*head*) (thing 20)))
 :VAR_SPEC ((1 (animate +)))
)

完整数据可在此处获取https://raw.githubusercontent.com/ihmc/LCS/master/verbs-English.lcs

我尝试过在这篇文章Parsing a lisp file with Python中发表的想法,使用了类似的东西,但我得到了一种与我所寻找的格式不同的格式

inputdata = '''
(
 :DEF_WORD "abandon"
 :CLASS "51.2"
 :WN_SENSE (("1.5" 01269572 01188040 01269413 00345378)
            ("1.6" 01524319 01421290 01524047 00415625))
 :PROPBANK ("arg1 arg2")
 :THETA_ROLES ((1 "_th,src"))
 :LCS (go loc (* thing 2)
          (away_from loc (thing 2) (at loc (thing 2) (* thing 4)))
          (abandon+ingly 26))
 :VAR_SPEC ((4 :optional) (2 (animate +)))
)


(
 :DEF_WORD "abase"
 :CLASS "45.4.a"
 :WN_SENSE (("1.5" 01024949)
            ("1.6" 01228249))
 :PROPBANK ("arg0 arg1 arg2(with)")
 :THETA_ROLES ((1 "_ag_th,instr(with)"))
 :LCS (cause (* thing 1)
       (go ident (* thing 2)
           (toward ident (thing 2) (at ident (thing 2) (abase+ed 9))))
       ((* with 19) instr (*head*) (thing 20)))
 :VAR_SPEC ((1 (animate +)))
)'''

from pyparsing import OneOrMore, nestedExpr

data = OneOrMore(nestedExpr()).parseString(inputdata)
print (data)

我得到了如下输出:

[
  [ ':DEF_WORD', '"abandon"', 
    ':CLASS', '"51.2"', 
    ':WN_SENSE', [
                    ['"1.5"', '01269572', '01188040', '01269413', '00345378'], 
                    ['"1.6"', '01524319', '01421290', '01524047', '00415625']
                 ],
    ':PROPBANK', ['"arg1 arg2"'],
    ':THETA_ROLES', [['1', '"_th,src"']],
    ':LCS', ['go', 'loc', ['*', 'thing', '2'], 
          ['away_from', 'loc', ['thing', '2'], 
          ['at', 'loc', ['thing', '2'], ['*', 'thing', '4']]], ['abandon+ingly', '26']],
    ':VAR_SPEC', [['4', ':optional'], ['2', ['animate', '+']]]]
  ,     
  [':DEF_WORD', '"abase"', 
    ':CLASS', '"45.4.a"', 
    ':WN_SENSE', [
                    ['"1.5"', '01024949'],
                    ['"1.6"', '01228249']
                ], 
    ':PROPBANK', ['"arg0 arg1 arg2(with)"'], 
    ':THETA_ROLES', [['1', '"_ag_th,instr(with)"']],
    ':LCS', ['cause', ['*', 'thing', '1'], 
              ['go', 'ident', ['*', 'thing', '2'], 
              ['toward', 'ident', ['thing', '2'], 
              ['at', 'ident', ['thing', '2'],
              ['abase+ed', '9']]]],
              [['*', 'with', '19'], 'instr', ['*head*'], ['thing', '20']]], 
    ':VAR_SPEC', [['1', ['animate', '+']]]
  ]
]

我不知道如何处理这种输出格式,以便获得例如THETA_ROLES值或本词典中的其他动词特征,我用pandas和NLTK将我所有的句子排列在一个数组中,所以我们的想法是寻找在这个词汇中有一种动词和特殊的THETA_角色值或其他特征的句子


Tags: defwithlocclasswordrolesthingth
1条回答
网友
1楼 · 发布于 2024-09-29 02:20:20

您获得的数据是一个由成对键值组成的平面序列。也就是说,您有形式为["A", 1, "B", 2]的内容,但您需要类似{"A": 1, "B": 2}的dict

以下是一个生成器,它将以成对序列的形式返回展平序列:

def pairs(seq):
    for x, y in zip(seq[::2], seq[1::2]):
        yield (x, y)

print(dict(pairs(["A", 1, "B", 2])))

使用该方法将每个解析的组转换为Python dict,然后您可以通过名称轻松地从中提取位

for group in data:
    groupdict = dict(pairs(group))
    print(groupdict[":THETA_ROLES"])

相关问题 更多 >