我有一个NLTK解析树,我只想基于“s”标签来分离树的叶子。请注意,S不应与叶子重叠
他赢得了马拉松比赛,在30分钟内结束
corenlp的树形是
tree = '(S
(NP (PRP He))
(VP
(VBD won)
(NP (DT the) (NNP Gusher) (NNP Marathon))
(, ,)
(S (VP (VBG finishing) (PP (IN in) (NP (CD 30) (NNS minutes))))))
(. .))'
这个想法是提取两个“S”和它们的叶子,但不要相互重叠。所以预期的结果应该是“他赢得了喷泉马拉松” “30分钟内完成”
# Tree manipulation
# Extract phrases from a parsed (chunked) tree
# Phrase = tag for the string phrase (sub-tree) to extract
# Returns: List of deep copies; Recursive
def ExtractPhrases( myTree, phrase):
myPhrases = []
if (myTree.label() == phrase):
myPhrases.append( myTree.copy(True) )
for child in myTree:
if (type(child) is Tree):
list_of_phrases = ExtractPhrases(child, phrase)
if (len(list_of_phrases) > 0):
myPhrases.extend(list_of_phrases)
return myPhrases
subtexts = set()
sep_tree = ExtractPhrases( Tree.fromstring(tree), 'S')
for sep in sep_tree:
for subtree in sep.subtrees():
if subtree.label()=="S":
print(subtree)
subtexts.add(' '.join(subtree.leaves()))
#break
subtexts = list(subtexts)
print(subtexts)
我得到了结果
['He won the Gusher Marathon , finishing in 30 minutes .', 'finishing in 30 minutes']
我不想在字符串级操作它,而是在树级操作它,因此预期的输出将是-
["He won the Gusher Marathon ,.", "finishing in 30 minutes."]
目前没有回答
相关问题 更多 >
编程相关推荐