官方蟒蛇接口
stanford-corenlp的Python项目详细描述
这个包包含一个用于Stanford CoreNLP的python接口,该接口包含一个引用 与Stanford CoreNLP server接口的实现。 该包还包含一个基类,用于公开基于python的注释 corenlp的提供者(例如,你最喜欢的神经内窥器系统) 通过轻量级服务进行管道传输。
要使用这个包,首先下载official java CoreNLP release,解压缩它,然后定义一个环境
指向解压缩目录的变量$CORENLP_HOME
。
您还可以使用pip install stanford-corenlp
命令行用法
使用这个包最简单的方法可能是通过注释命令行实用程序:
usage: annotate [-h] [-i INPUT] [-o OUTPUT] [-f {json}] [-a ANNOTATORS [ANNOTATORS ...]] [-s] [-v] [-m MEMORY] [-p PROPS [PROPS ...]] Annotate data optional arguments: -h, --help show this help message and exit -i INPUT, --input INPUT Input file to process; each line contains one document (default: stdin) -o OUTPUT, --output OUTPUT File to write annotations to (default: stdout) -f {json}, --format {json} Output format -a ANNOTATORS [ANNOTATORS ...], --annotators ANNOTATORS [ANNOTATORS ...] A list of annotators -s, --sentence-mode Assume each line of input is a sentence. -v, --verbose-server Server is made verbose -m MEMORY, --memory MEMORY Memory to use for the server -p PROPS [PROPS ...], --props PROPS [PROPS ...] Properties as a list of key=value pairs
我们建议结合使用注释和美妙的jq 处理输出的命令。例如,给定一个文件 在每行中,下面的命令生成一个等价的 空格分隔标记:
cat file.txt | annotate -s -a tokenize | jq '[.tokens[].originalText]' > tokenized.txt
注释服务器用法
importcorenlptext="Chris wrote a simple sentence that he parsed with Stanford CoreNLP."# We assume that you've downloaded Stanford CoreNLP and defined an environment# variable $CORENLP_HOME that points to the unzipped directory.# The code below will launch StanfordCoreNLPServer in the background# and communicate with the server to annotate the sentence.withcorenlp.CoreNLPClient(annotators="tokenize ssplit pos lemma ner depparse".split())asclient:ann=client.annotate(text)# You can access annotations using ann.sentence=ann.sentence[0]# The corenlp.to_text function is a helper function that# reconstructs a sentence from tokens.assertcorenlp.to_text(sentence)==text# You can access any property within a sentence.print(sentence.text)# Likewise for tokenstoken=sentence.token[0]print(token.lemma)# Use tokensregex patterns to find who wrote a sentence.pattern='([ner: PERSON]+) /wrote/ /an?/ []{0,3} /sentence|article/'matches=client.tokensregex(text,pattern)# sentences contains a list with matches for each sentence.assertlen(matches["sentences"])==1# length tells you whether or not there are any matches in thisassertmatches["sentences"][0]["length"]==1# You can access matches like most regex groups.matches["sentences"][1]["0"]["text"]=="Chris wrote a simple sentence"matches["sentences"][1]["0"]["1"]["text"]=="Chris"# Use semgrex patterns to directly find who wrote what.pattern='{word:wrote} >nsubj {}=subject >dobj {}=object'matches=client.semgrex(text,pattern)# sentences contains a list with matches for each sentence.assertlen(matches["sentences"])==1# length tells you whether or not there are any matches in thisassertmatches["sentences"][0]["length"]==1# You can access matches like most regex groups.matches["sentences"][1]["0"]["text"]=="wrote"matches["sentences"][1]["0"]["$subject"]["text"]=="Chris"matches["sentences"][1]["0"]["$object"]["text"]=="sentence"
有关更多示例,请参见test_client.py和test_protobuf.py。道具 @Dan Zheng获得TokensRegex/Semgrex支持。
注释服务用法
note:注释服务允许用户提供自定义的 corenlp管道要使用的注释器。不幸的是,它依赖 斯坦福大学corenlp项目内部的实验代码还没有 可供公众使用。
importcorenlpfrom.happyfuntokenizerimportTokenizerclassHappyFunTokenizer(Tokenizer,corenlp.Annotator):def__init__(self,preserve_case=False):Tokenizer.__init__(self,preserve_case)corenlp.Annotator.__init__(self)@propertydefname(self):""" Name of the annotator (used by CoreNLP) """return"happyfun"@propertydefrequires(self):""" Requires has to specify all the annotations required before we are called. """return[]@propertydefprovides(self):""" The set of annotations guaranteed to be provided when we are done. NOTE: that these annotations are either fully qualified Java class names or refer to nested classes of edu.stanford.nlp.ling.CoreAnnotations (as is the case below). """return["TextAnnotation","TokensAnnotation","TokenBeginAnnotation","TokenEndAnnotation","CharacterOffsetBeginAnnotation","CharacterOffsetEndAnnotation",]defannotate(self,ann):""" @ann: is a protobuf annotation object. Actually populate @ann with tokens. """buf,beg_idx,end_idx=ann.text.lower(),0,0fori,wordinenumerate(self.tokenize(ann.text)):token=ann.sentencelessToken.add()# These are the bare minimum required for the TokenAnnotationtoken.word=wordtoken.tokenBeginIndex=itoken.tokenEndIndex=i+1# Seek into the txt until you can find this word.try:# Try to update beginning indexbeg_idx=buf.index(word,beg_idx)exceptValueError:# Give up -- this will be something randomend_idx=beg_idx+len(word)token.beginChar=beg_idxtoken.endChar=end_idxbeg_idx,end_idx=end_idx,end_idxannotator=HappyFunTokenizer()# Calling .start() will launch the annotator as a service running on# port 8432 by default.annotator.start()# annotator.properties contains all the right properties for# Stanford CoreNLP to use this annotator.withcorenlp.CoreNLPClient(properties=annotator.properties,annotators="happyfun ssplit pos".split())asclient:ann=client.annotate("RT @ #happyfuncoding: this is a typical Twitter tweet :-)")tokens=[t.wordfortinann.sentence[0].token]print(tokens)
有关更多示例,请参见test_annotator.py。