有 Java 编程相关的问题?

你可以在下面搜索框中键入要查询的问题!

java NLTK Stanford Segmentor,如何设置类路径

我试图使用NLTK令牌化包中的Stanford Segementer位。然而,我在尝试使用基本测试集时遇到了一些问题。运行以下命令:

# -*- coding: utf-8 -*-
from nltk.tokenize.stanford_segmenter import StanfordSegmenter
seg = StanfordSegmenter()
seg.default_config('zh')
sent = u'这是斯坦福中文分词器测试'
print(seg.segment(sent))

导致此错误的原因: Error

我甚至补充说

import os
javapath = "C:/Users/User/Folder/stanford-segmenter-2017-06-09/*"
os.environ['CLASSPATH'] = javapath

。。。在我的代码前面,但这似乎没有帮助

如何使分节器正常运行


共 (1) 个答案

  1. # 1 楼答案

    注意:此解决方案仅适用于以下情况:

    • NLTK v3。2.5(v3.2.6将有一个更简单的界面)
    • 斯坦福大学CoreNLP(版本>;=2016-10-31)

    首先,您必须首先正确安装Java 8,如果Stanford CoreNLP在命令行上工作,则NLTK v3中的Stanford CoreNLP API。2.5如下

    注意:在NLTK中使用新的CoreNLP API之前,您必须在终端中启动CoreNLP服务器

    英式

    在终端:

    wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
    unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31
    
    java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
    -preload tokenize,ssplit,pos,lemma,parse,depparse \
    -status_port 9000 -port 9000 -timeout 15000
    

    在Python中:

    >>> from nltk.tag.stanford import CoreNLPPOSTagger, CoreNLPNERTagger
    >>> stpos, stner = CoreNLPPOSTagger(), CoreNLPNERTagger()
    >>> stpos.tag('What is the airspeed of an unladen swallow ?'.split())
    [(u'What', u'WP'), (u'is', u'VBZ'), (u'the', u'DT'), (u'airspeed', u'NN'), (u'of', u'IN'), (u'an', u'DT'), (u'unladen', u'JJ'), (u'swallow', u'VB'), (u'?', u'.')]
    >>> stner.tag('Rami Eid is studying at Stony Brook University in NY'.split())
    [(u'Rami', u'PERSON'), (u'Eid', u'PERSON'), (u'is', u'O'), (u'studying', u'O'), (u'at', u'O'), (u'Stony', u'ORGANIZATION'), (u'Brook', u'ORGANIZATION'), (u'University', u'ORGANIZATION'), (u'in', u'O'), (u'NY', u'O')]
    

    中国人

    在终端:

    wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
    unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31
    wget http://nlp.stanford.edu/software/stanford-chinese-corenlp-2016-10-31-models.jar
    wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-chinese.properties 
    
    java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
    -serverProperties StanfordCoreNLP-chinese.properties \
    -preload tokenize,ssplit,pos,lemma,ner,parse \
    -status_port 9001  -port 9001 -timeout 15000
    

    用Python

    >>> from nltk.tag.stanford import CoreNLPPOSTagger, CoreNLPNERTagger
    >>> from nltk.tokenize.stanford import CoreNLPTokenizer
    >>> stpos, stner = CoreNLPPOSTagger('http://localhost:9001'), CoreNLPNERTagger('http://localhost:9001')
    >>> sttok = CoreNLPTokenizer('http://localhost:9001')
    
    >>> sttok.tokenize(u'我家没有电脑。')
    ['我家', '没有', '电脑', '。']
    
    # Without segmentation (input to`raw_string_parse()` is a list of single char strings)
    >>> stpos.tag(u'我家没有电脑。')
    [('我', 'PN'), ('家', 'NN'), ('没', 'AD'), ('有', 'VV'), ('电', 'NN'), ('脑', 'NN'), ('。', 'PU')]
    # With segmentation
    >>> stpos.tag(sttok.tokenize(u'我家没有电脑。'))
    [('我家', 'NN'), ('没有', 'VE'), ('电脑', 'NN'), ('。', 'PU')]
    
    # Without segmentation (input to`raw_string_parse()` is a list of single char strings)
    >>> stner.tag(u'奥巴马与迈克尔·杰克逊一起去杂货店购物。')
    [('奥', 'GPE'), ('巴', 'GPE'), ('马', 'GPE'), ('与', 'O'), ('迈', 'O'), ('克', 'PERSON'), ('尔', 'PERSON'), ('·', 'O'), ('杰', 'O'), ('克', 'O'), ('逊', 'O'), ('一', 'NUMBER'), ('起', 'O'), ('去', 'O'), ('杂', 'O'), ('货', 'O'), ('店', 'O'), ('购', 'O'), ('物', 'O'), ('。', 'O')]
    # With segmentation
    >>> stner.tag(sttok.tokenize(u'奥巴马与迈克尔·杰克逊一起去杂货店购物。'))
    [('奥巴马', 'PERSON'), ('与', 'O'), ('迈克尔·杰克逊', 'PERSON'), ('一起', 'O'), ('去', 'O'), ('杂货店', 'O'), ('购物', 'O'), ('。', 'O')]
    

    德文

    在终端:

    wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
    unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31
    
    wget http://nlp.stanford.edu/software/stanford-german-corenlp-2016-10-31-models.jar
    wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-german.properties
    
    java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
    -serverProperties StanfordCoreNLP-german.properties \
    -preload tokenize,ssplit,pos,ner,parse \
    -status_port 9002  -port 9002 -timeout 15000
    

    在Python中:

    >>> from nltk.tag.stanford import CoreNLPPOSTagger, CoreNLPNERTagger
    >>> stpos, stner = CoreNLPPOSTagger('http://localhost:9002'), CoreNLPNERTagger('http://localhost:9002')
    
    >>> stpos.tag('Ich bin schwanger'.split())
    [('Ich', 'PPER'), ('bin', 'VAFIN'), ('schwanger', 'ADJD')]
    
    >>> stner.tag('Donald Trump besuchte Angela Merkel in Berlin.'.split())
    [('Donald', 'I-PER'), ('Trump', 'I-PER'), ('besuchte', 'O'), ('Angela', 'I-PER'), ('Merkel', 'I-PER'), ('in', 'O'), ('Berlin', 'I-LOC'), ('.', 'O')]
    

    西班牙文

    在终端:

    wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
    unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31
    
    wget http://nlp.stanford.edu/software/stanford-spanish-corenlp-2016-10-31-models.jar
    wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-spanish.properties
    
    java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
    -serverProperties StanfordCoreNLP-spanish.properties \
    -preload tokenize,ssplit,pos,ner,parse \
    -status_port 9003  -port 9003 -timeout 15000
    

    在Python中:

    >>> from nltk.tag.stanford import CoreNLPPOSTagger, CoreNLPNERTagger
    >>> stpos, stner = CoreNLPPOSTagger('http://localhost:9003'), CoreNLPNERTagger('http://localhost:9003')
    
    >>> stner.tag(u'Barack Obama salió con Michael Jackson .'.split())
    [(u'Barack', u'PERS'), (u'Obama', u'PERS'), (u'sali\xf3', u'O'), (u'con', u'O'), (u'Michael', u'PERS'), (u'Jackson', u'PERS'), (u'.', u'O')]
    
    >>> stpos.tag(u'Barack Obama salió con Michael Jackson .'.split())
    [(u'Barack', u'np00000'), (u'Obama', u'np00000'), (u'sali\xf3', u'vmis000'), (u'con', u'sp000'), (u'Michael', u'np00000'), (u'Jackson', u'np00000'), (u'.', u'fp')]
    

    法语的

    在终端:

    wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
    unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31
    
    wget http://nlp.stanford.edu/software/stanford-french-corenlp-2016-10-31-models.jar
    wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-french.properties
    
    java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
    -serverProperties StanfordCoreNLP-french.properties \
    -preload tokenize,ssplit,pos,parse \
    -status_port 9004  -port 9004 -timeout 15000
    

    在Python中:

    >>> from nltk.tag.stanford import CoreNLPPOSTagger
    >>> stpos = CoreNLPPOSTagger('http://localhost:9004')
    >>> stpos.tag('Je suis enceinte'.split())
    [(u'Je', u'CLS'), (u'suis', u'V'), (u'enceinte', u'NC')]
    

    阿拉伯文

    在终端:

    wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
    unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31
    
    wget http://nlp.stanford.edu/software/stanford-arabic-corenlp-2016-10-31-models.jar
    wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-arabic.properties
    
    java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
    -serverProperties StanfordCoreNLP-french.properties \
    -preload tokenize,ssplit,pos,parse \
    -status_port 9005  -port 9005 -timeout 15000
    

    在Python中:

    >>> from nltk.tag.stanford import CoreNLPPOSTagger
    >>> from nltk.tokenize.stanford import CoreNLPTokenizer
    >>> sttok = CoreNLPTokenizer('http://localhost:9005')
    >>> stpos = CoreNLPPOSTagger('http://localhost:9005')
    >>> text = u'انا حامل'
    >>> stpos.tag(sttok.tokenize(text))
    [('انا', 'DET'), ('حامل', 'NC')]