多字符串上的序列

2条回答

网友

1楼 · 编辑于 2024-10-05 14:26:51

我可能有一个使用nltk的解决方案。我试过使用sksequitur，但没有成功。你可以试着把两者结合起来。以下是我所拥有的：

import nltk

Corpus=['B','a','n','v','a','n','E','B','a','n','v','E','B','a','n','E']

nbSentences=Corpus.count('B')  # Counts the nb. if sentences (B for "BEGIN" and "E" for END)

print('Nb. of sentences: ',nbSentences)

C='C -> '+'T '*nbSentences  # The corpus C is made of nbSentences "Tokens" 

core_grammar=  """
 T -> BEGIN S END
 S -> NP VP | NP
 PP -> P NP
 NP -> A N
 VP -> V NP | V
 A -> 'a'
 N -> 'n'
 V -> 'v'
 BEGIN -> 'B'
 END -> 'E' 
"""

# Generate the grammar:
gramm_str=C+core_grammar
print('grammar string: \n',gramm_str)

# Parsing:
simple_grammar = nltk.CFG.fromstring(gramm_str)
parser = nltk.ChartParser(simple_grammar)
tree = parser.parse(Corpus)

#print(list(tree)[0]) # simple output
list(tree)[0].pretty_print() # for a pretty_print
#list(tree)[0].draw() # draw in w window
#list(tree)[0] # to draw tree in jupyter notebook

结果是：

如您所见，所有句子都经过处理，每个句子都生成自己的树（句子之间没有交叉）现在，如果你有数百万个句子。。。这可能是个问题

致以最良好的祝愿，圣菲

网友

2楼 · 编辑于 2024-10-05 14:26:51

下面是一个新版本，它使用sksequitur生成语法，然后使用nltk对其进行解析。注意使用Mark（）特殊符号（参见sksequitur文档）“标记符号不能成为规则的一部分。”

代码如下：

import nltk
import sksequitur
from sksequitur import parse, Parser, Grammar, Mark, Production

Corpus='anvanBanvBanBv' # Note: "B" is for "BREAK"

# Length of the corpus
nbSentences=len(Corpus.split('B'))
print('Nb. of sentences in corpus: ',nbSentences)

corpus=[]
n=0
for c in Corpus:
    if c=="B":
        corpus.append([Mark()])
        n+=1
    else:
        corpus.append(c)
print("Corpus ready to feed:")
print(corpus)


# Parsing the corpus
parser=Parser()
for c in corpus:
    parser.feed(c)
grammar=Grammar(parser.tree)
print("grammar: ")
print(grammar)


# Create Corpus rule: C for nltk
C='C -> '+'0 '*nbSentences+'\n'  # The corpus C is made of nbSentences "Tokens" 
print("Corpus rule:")
print(C)


# Nb of rules in the grammar:
nbRules=len(grammar)
print('Nb of rules: ',nbRules)


# Identify "Atoms" in the grammar (e.g. "a","n","v")
# An atom is anything that is not a mark or a production

atoms=[]
for i in range(nbRules):
    name=str(i)
    rule=grammar[i]
    for a in rule:
        if not(isinstance(a,type(Mark()))) and not(str(a).isdigit()):
            atoms.append(a)
atoms=list(set(atoms))    


# Create "Atoms" rules for nltk:
Atoms=''
for atom in atoms:
    if not(type(atom))==type(Mark()):
        Atoms += atom+' -> '+'\''+atom+'\''+'\n'
# add "B as an atom
Atoms += 'B -> '+'\''+'B'+'\''
print("Atoms rules: ")
print(Atoms)

# Create Core grammar;
Core=''
for i in range(nbRules):
    rule=grammar[i]
    ruleName=str(i)
    if ruleName=="0":
        Core +='0 -> '
        for j in range(len(rule)):
            if not(type(rule[j]))==type(Mark()):
                Core+=' '+str(rule[j])
            else:
                Core+=' B'+'\n'
                Core+='0 -> '
    else:
        Core += ruleName+' ->'
        for j in range(len(rule)):
            Core += ' '+str(rule[j])
    Core += '\n'
print("Core:")
print(Core)


newGrammar=C+Core+Atoms
print("Grammar for nlkt:")
print(newGrammar)

simple_grammar = nltk.CFG.fromstring(newGrammar)
parser = nltk.ChartParser(simple_grammar)
tree = parser.parse(Corpus)


#print(list(tree)[0]) # simple output
#list(tree)[0].pretty_print() # for a pretty_print
#list(tree)[0] # to draw tree in jupyter notebook

list(tree)[0].draw() # draw in w window

结果是：

这就是你的想法吗

致以最良好的祝愿，圣菲

相关问题更多 >

编程相关推荐

热门问题

热门文章

多字符串上的序列

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >