<p>我倾向于让它尽可能的简单,不要尝试重新设计当前不允许使用的解析模块。比如:</p>
<pre><code>string = '''
(ROOT
(S
(NP (NN Carnac) (DT the) (NN Magnificent))
(VP (VBD gave) (NP (DT a) (NN talk)))
)
)
'''
def is_symbol_char(character):
'''
Predicate to test if a character is valid
for use in a symbol, extend as needed.
'''
return character.isalpha() or character in '-=$!?.'
def tokenize(characters):
'''
Process characters into a nested structure. The original string
'(DT the)' is passed in as ['(', 'D', 'T', ' ', 't', 'h', 'e', ')']
'''
tokens = []
while characters:
character = characters.pop(0)
if character.isspace():
pass # nothing to do, ignore it
elif character == '(': # signals start of recursive analysis (push)
characters, result = tokenize(characters)
tokens.append(result)
elif character == ')': # signals end of recursive analysis (pop)
break
elif is_symbol_char(character):
# if it looks like a symbol, collect all
# subsequents symbol characters
symbol = ''
while is_symbol_char(character):
symbol += character
character = characters.pop(0)
# push unused non-symbol character back onto characters
characters.insert(0, character)
tokens.append(symbol)
# Return whatever tokens we collected and any characters left over
return characters, tokens
def extract_rules(tokens):
''' Recursively walk tokenized data extracting rules. '''
head, *tail = tokens
print(head, ' >', *[x[0] if isinstance(x, list) else x for x in tail])
for token in tail: # recurse
if isinstance(token, list):
extract_rules(token)
characters, tokens = tokenize(list(string))
# After a successful tokenization, all the characters should be consumed
assert not characters, "Didn't consume all the input!"
print('Tokens:', tokens[0], 'Rules:', sep='\n\n', end='\n\n')
extract_rules(tokens[0])
</code></pre>
<p><strong>输出</strong></p>
^{pr2}$
<p><strong>注意</strong></p>
<p>我把你原来的树改成这样:</p>
<pre><code>(NP ((DT a) (NN talk)))
</code></pre>
<p>似乎不正确,因为它在web上可用的语法树图示器上生成了一个空节点,因此我将其简化为:</p>
<pre><code>(NP (DT a) (NN talk))
</code></pre>
<p>根据需要进行调整。在</p>