结合nltk.RegexpParser语法

2024-09-26 17:43:19 发布

男 | 程序猿一只，喜欢编程写python代码。

作为进一步了解NLP的下一步，我尝试实现一个简单的启发式方法，它可以在简单的n-gram之外改进结果。在

根据下面链接的Stanford搭配PDF，他们提到将“候选短语通过只允许那些可能是“短语”的模式的词性过滤器传递比简单地使用最频繁出现的双字词产生更好的效果。来源：搭配，143-144页：https://nlp.stanford.edu/fsnlp/promo/colloc.pdf

144页的表格有7种标记模式。按顺序，NLTK POS标签等价物为：

JJ NN公司

NN公司

JJ JJ NN公司

JJ NN NN网络

NN JJ NN

NN-NN-NN

NN中NN

在下面的代码中，当我独立地应用下面的语法时，我可以得到期望的结果。然而，当我试图组合相同的语法时，我并没有得到想要的结果。在

在我的代码中，可以看到我取消了一个句子的注释，取消了1个语法的注释，运行它并检查结果。在

我应该能够组合所有的句子，通过组合的语法运行它（在下面的代码中只有3个）并得到期望的结果。在

My question is, how do I correctly combine grammars?

我假设组合语法就像“OR”，找到这个模式，或者这个模式。。。在

提前谢谢。在

import nltk

# The following sentences are correctly grouped with <JJ>*<NN>+. 
# Should see: 'linear function', 'regression coefficient', 'Gaussian random variable' and 
# 'cumulative distribution function'
SampleSentence = "In mathematics, the term linear function refers to two distinct, although related, notions"
#SampleSentence = "The regression coefficient is the slope of the line of the regression equation."
#SampleSentence = "In probability theory, Gaussian random variable is a very common continuous probability distribution."
#SampleSentence = "In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x."

# The following sentences are correctly grouped with <NN.?>*<V.*>*<NN>
# Should see 'mean squared error' and # 'class probability function'. 
#SampleSentence = "In statistics, the mean squared error (MSE) of an estimator measures the average of the squares of the errors, that is, the difference between the estimator and what is estimated."
#SampleSentence = "The class probability function is interesting"

# The sentence below is correctly grouped with <NN.?>*<IN>*<NN.?>*. 
# should see 'degrees of freedom'.
#SampleSentence = "In statistics, the degrees of freedom is the number of values in the final calculation of a statistic that are free to vary."

SampleSentence = SampleSentence.lower()

print("\nFull sentence: ", SampleSentence, "\n")

tokens = nltk.word_tokenize(SampleSentence)
textTokens = nltk.Text(tokens)    

# Determine the POS tags.
POStagList = nltk.pos_tag(textTokens)    

# The following grammars work well *independently*
grammar = "NP: {<JJ>*<NN>+}"
#grammar = "NP: {<NN.?>*<V.*>*<NN>}"    
#grammar = "NP: {<NN.?>*<IN>*<NN.?>*}"


# Merge several grammars above into a single one below. 
# Note that all 3 correct grammars above are included below. 

'''
grammar = """
            NP: 
                {<JJ>*<NN>+}
                {<NN.?>*<V.*>*<NN>}
                {<NN.?>*<IN>*<NN.?>*}
        """
'''

cp = nltk.RegexpParser(grammar)

result = cp.parse(POStagList)

for subtree in result.subtrees(filter=lambda t: t.label() == 'NP'):
    print("NP Subtree:", subtree)

Tags： of the in is np 模式语法 function

1条回答

网友

1楼 · 发布于 2024-09-26 17:43:19

如果我的评论是你想要的，那么下面是答案：

grammar = """
            NP: 
                {<JJ>*<NN.?>*<V.|IN>*<NN.?>*}"""

结合nltk.RegexpParser语法

相关问题更多 >

编程相关推荐

热门问题

热门文章

结合nltk.RegexpParser语法

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >