基于lab的NLTK子树分离

2024-09-26 22:12:27 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个NLTK解析树,我只想基于“s”标签来分离树的叶子。请注意,S不应与叶子重叠

他赢得了马拉松比赛,在30分钟内结束

corenlp的树形是

tree = '(S
  (NP (PRP He))
  (VP
    (VBD won)
    (NP (DT the) (NNP Gusher) (NNP Marathon))
    (, ,)
    (S (VP (VBG finishing) (PP (IN in) (NP (CD 30) (NNS minutes))))))
  (. .))'

这个想法是提取两个“S”和它们的叶子,但不要相互重叠。所以预期的结果应该是“他赢得了喷泉马拉松” “30分钟内完成”

# Tree manipulation

# Extract phrases from a parsed (chunked) tree
# Phrase = tag for the string phrase (sub-tree) to extract
# Returns: List of deep copies;  Recursive
def ExtractPhrases( myTree, phrase):
    myPhrases = []
    if (myTree.label() == phrase):
        myPhrases.append( myTree.copy(True) )
    for child in myTree:
        if (type(child) is Tree):
            list_of_phrases = ExtractPhrases(child, phrase)
            if (len(list_of_phrases) > 0):
                myPhrases.extend(list_of_phrases)
    return myPhrases
subtexts = set()
sep_tree = ExtractPhrases( Tree.fromstring(tree), 'S')
for sep in sep_tree:
    for subtree in sep.subtrees():
        if subtree.label()=="S":
            print(subtree)
            subtexts.add(' '.join(subtree.leaves()))
            #break

subtexts = list(subtexts)
print(subtexts)

我得到了结果

['He won the Gusher Marathon , finishing in 30 minutes .', 'finishing in 30 minutes']

我不想在字符串级操作它,而是在树级操作它,因此预期的输出将是-

["He won the Gusher Marathon ,.",  "finishing in 30 minutes."]

Tags: oftheintreeforifseplist

热门问题