官方蟒蛇接口

stanford-corenlp的Python项目详细描述


https://travis-ci.org/stanfordnlp/python-stanford-corenlp.svg?branch=master

这个包包含一个用于Stanford CoreNLP的python接口,该接口包含一个引用 与Stanford CoreNLP server接口的实现。 该包还包含一个基类,用于公开基于python的注释 corenlp的提供者(例如,你最喜欢的神经内窥器系统) 通过轻量级服务进行管道传输。

要使用这个包,首先下载official java CoreNLP release,解压缩它,然后定义一个环境 指向解压缩目录的变量$CORENLP_HOME

您还可以使用pip install stanford-corenlp

PyPI安装此软件包。

命令行用法

使用这个包最简单的方法可能是通过注释命令行实用程序:

usage: annotate [-h] [-i INPUT] [-o OUTPUT] [-f {json}]
                [-a ANNOTATORS [ANNOTATORS ...]] [-s] [-v] [-m MEMORY]
                [-p PROPS [PROPS ...]]

Annotate data

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input file to process; each line contains one document
                        (default: stdin)
  -o OUTPUT, --output OUTPUT
                        File to write annotations to (default: stdout)
  -f {json}, --format {json}
                        Output format
  -a ANNOTATORS [ANNOTATORS ...], --annotators ANNOTATORS [ANNOTATORS ...]
                        A list of annotators
  -s, --sentence-mode   Assume each line of input is a sentence.
  -v, --verbose-server  Server is made verbose
  -m MEMORY, --memory MEMORY
                        Memory to use for the server
  -p PROPS [PROPS ...], --props PROPS [PROPS ...]
                        Properties as a list of key=value pairs

我们建议结合使用注释和美妙的jq 处理输出的命令。例如,给定一个文件 在每行中,下面的命令生成一个等价的 空格分隔标记:

cat file.txt | annotate -s -a tokenize | jq '[.tokens[].originalText]' > tokenized.txt

注释服务器用法

importcorenlptext="Chris wrote a simple sentence that he parsed with Stanford CoreNLP."# We assume that you've downloaded Stanford CoreNLP and defined an environment# variable $CORENLP_HOME that points to the unzipped directory.# The code below will launch StanfordCoreNLPServer in the background# and communicate with the server to annotate the sentence.withcorenlp.CoreNLPClient(annotators="tokenize ssplit pos lemma ner depparse".split())asclient:ann=client.annotate(text)# You can access annotations using ann.sentence=ann.sentence[0]# The corenlp.to_text function is a helper function that# reconstructs a sentence from tokens.assertcorenlp.to_text(sentence)==text# You can access any property within a sentence.print(sentence.text)# Likewise for tokenstoken=sentence.token[0]print(token.lemma)# Use tokensregex patterns to find who wrote a sentence.pattern='([ner: PERSON]+) /wrote/ /an?/ []{0,3} /sentence|article/'matches=client.tokensregex(text,pattern)# sentences contains a list with matches for each sentence.assertlen(matches["sentences"])==1# length tells you whether or not there are any matches in thisassertmatches["sentences"][0]["length"]==1# You can access matches like most regex groups.matches["sentences"][1]["0"]["text"]=="Chris wrote a simple sentence"matches["sentences"][1]["0"]["1"]["text"]=="Chris"# Use semgrex patterns to directly find who wrote what.pattern='{word:wrote} >nsubj {}=subject >dobj {}=object'matches=client.semgrex(text,pattern)# sentences contains a list with matches for each sentence.assertlen(matches["sentences"])==1# length tells you whether or not there are any matches in thisassertmatches["sentences"][0]["length"]==1# You can access matches like most regex groups.matches["sentences"][1]["0"]["text"]=="wrote"matches["sentences"][1]["0"]["$subject"]["text"]=="Chris"matches["sentences"][1]["0"]["$object"]["text"]=="sentence"

有关更多示例,请参见test_client.pytest_protobuf.py。道具 @Dan Zheng获得TokensRegex/Semgrex支持。

注释服务用法

note:注释服务允许用户提供自定义的 corenlp管道要使用的注释器。不幸的是,它依赖 斯坦福大学corenlp项目内部的实验代码还没有 可供公众使用。

importcorenlpfrom.happyfuntokenizerimportTokenizerclassHappyFunTokenizer(Tokenizer,corenlp.Annotator):def__init__(self,preserve_case=False):Tokenizer.__init__(self,preserve_case)corenlp.Annotator.__init__(self)@propertydefname(self):"""
        Name of the annotator (used by CoreNLP)
        """return"happyfun"@propertydefrequires(self):"""
        Requires has to specify all the annotations required before we
        are called.
        """return[]@propertydefprovides(self):"""
        The set of annotations guaranteed to be provided when we are done.
        NOTE: that these annotations are either fully qualified Java
        class names or refer to nested classes of
        edu.stanford.nlp.ling.CoreAnnotations (as is the case below).
        """return["TextAnnotation","TokensAnnotation","TokenBeginAnnotation","TokenEndAnnotation","CharacterOffsetBeginAnnotation","CharacterOffsetEndAnnotation",]defannotate(self,ann):"""
        @ann: is a protobuf annotation object.
        Actually populate @ann with tokens.
        """buf,beg_idx,end_idx=ann.text.lower(),0,0fori,wordinenumerate(self.tokenize(ann.text)):token=ann.sentencelessToken.add()# These are the bare minimum required for the TokenAnnotationtoken.word=wordtoken.tokenBeginIndex=itoken.tokenEndIndex=i+1# Seek into the txt until you can find this word.try:# Try to update beginning indexbeg_idx=buf.index(word,beg_idx)exceptValueError:# Give up -- this will be something randomend_idx=beg_idx+len(word)token.beginChar=beg_idxtoken.endChar=end_idxbeg_idx,end_idx=end_idx,end_idxannotator=HappyFunTokenizer()# Calling .start() will launch the annotator as a service running on# port 8432 by default.annotator.start()# annotator.properties contains all the right properties for# Stanford CoreNLP to use this annotator.withcorenlp.CoreNLPClient(properties=annotator.properties,annotators="happyfun ssplit pos".split())asclient:ann=client.annotate("RT @ #happyfuncoding: this is a typical Twitter tweet :-)")tokens=[t.wordfortinann.sentence[0].token]print(tokens)

有关更多示例,请参见test_annotator.py。

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java GWT对话框从不显示帮助   java在简单的MapReduce作业中带来了极大的开销   javacom。mysql。jdbc。例外情况。jdbc4。MySQLIntegrityConstraintViolationException:列不能为null/onetoone映射   如何通过TCP/IP与Java和Labview进行通信,并发送浮点数据缓冲区?   java Apache camel与spring事件基本示例   java如何使我的秒表应用程序在完全关闭后仍能运行?   java Nutch爬网错误输入路径不存在   java是Mapreduce中按值传递还是按引用传递的键?   正则表达式替换java中的特定字符   Java DOM XML解析   java Eclipse未显示服务器的项目   使用Arraylist进行快速排序的排序Java实现疑难解答   java Split text包含字符串列表中的数字   检查Java中的两个lambda是否执行相同的代码?   java为什么dispatchTouchEvent避免在屏幕上单击?