


这个包包含一个用于Stanford CoreNLP的python接口,该接口包含一个引用 与Stanford CoreNLP server接口的实现。 该包还包含一个基类,用于公开基于python的注释 corenlp的提供者(例如,你最喜欢的神经内窥器系统) 通过轻量级服务进行管道传输。

要使用这个包,首先下载official java CoreNLP release,解压缩它,然后定义一个环境 指向解压缩目录的变量$CORENLP_HOME

您还可以使用pip install stanford-corenlp




usage: annotate [-h] [-i INPUT] [-o OUTPUT] [-f {json}]
                [-a ANNOTATORS [ANNOTATORS ...]] [-s] [-v] [-m MEMORY]
                [-p PROPS [PROPS ...]]

Annotate data

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input file to process; each line contains one document
                        (default: stdin)
  -o OUTPUT, --output OUTPUT
                        File to write annotations to (default: stdout)
  -f {json}, --format {json}
                        Output format
                        A list of annotators
  -s, --sentence-mode   Assume each line of input is a sentence.
  -v, --verbose-server  Server is made verbose
  -m MEMORY, --memory MEMORY
                        Memory to use for the server
  -p PROPS [PROPS ...], --props PROPS [PROPS ...]
                        Properties as a list of key=value pairs

我们建议结合使用注释和美妙的jq 处理输出的命令。例如,给定一个文件 在每行中,下面的命令生成一个等价的 空格分隔标记:

cat file.txt | annotate -s -a tokenize | jq '[.tokens[].originalText]' > tokenized.txt


importcorenlptext="Chris wrote a simple sentence that he parsed with Stanford CoreNLP."# We assume that you've downloaded Stanford CoreNLP and defined an environment# variable $CORENLP_HOME that points to the unzipped directory.# The code below will launch StanfordCoreNLPServer in the background# and communicate with the server to annotate the sentence.withcorenlp.CoreNLPClient(annotators="tokenize ssplit pos lemma ner depparse".split())asclient:ann=client.annotate(text)# You can access annotations using ann.sentence=ann.sentence[0]# The corenlp.to_text function is a helper function that# reconstructs a sentence from tokens.assertcorenlp.to_text(sentence)==text# You can access any property within a sentence.print(sentence.text)# Likewise for tokenstoken=sentence.token[0]print(token.lemma)# Use tokensregex patterns to find who wrote a sentence.pattern='([ner: PERSON]+) /wrote/ /an?/ []{0,3} /sentence|article/'matches=client.tokensregex(text,pattern)# sentences contains a list with matches for each sentence.assertlen(matches["sentences"])==1# length tells you whether or not there are any matches in thisassertmatches["sentences"][0]["length"]==1# You can access matches like most regex groups.matches["sentences"][1]["0"]["text"]=="Chris wrote a simple sentence"matches["sentences"][1]["0"]["1"]["text"]=="Chris"# Use semgrex patterns to directly find who wrote what.pattern='{word:wrote} >nsubj {}=subject >dobj {}=object'matches=client.semgrex(text,pattern)# sentences contains a list with matches for each sentence.assertlen(matches["sentences"])==1# length tells you whether or not there are any matches in thisassertmatches["sentences"][0]["length"]==1# You can access matches like most regex groups.matches["sentences"][1]["0"]["text"]=="wrote"matches["sentences"][1]["0"]["$subject"]["text"]=="Chris"matches["sentences"][1]["0"]["$object"]["text"]=="sentence"

有关更多示例,请参见test_client.pytest_protobuf.py。道具 @Dan Zheng获得TokensRegex/Semgrex支持。


note:注释服务允许用户提供自定义的 corenlp管道要使用的注释器。不幸的是,它依赖 斯坦福大学corenlp项目内部的实验代码还没有 可供公众使用。

        Name of the annotator (used by CoreNLP)
        Requires has to specify all the annotations required before we
        are called.
        The set of annotations guaranteed to be provided when we are done.
        NOTE: that these annotations are either fully qualified Java
        class names or refer to nested classes of
        edu.stanford.nlp.ling.CoreAnnotations (as is the case below).
        @ann: is a protobuf annotation object.
        Actually populate @ann with tokens.
        """buf,beg_idx,end_idx=ann.text.lower(),0,0fori,wordinenumerate(self.tokenize(ann.text)):token=ann.sentencelessToken.add()# These are the bare minimum required for the TokenAnnotationtoken.word=wordtoken.tokenBeginIndex=itoken.tokenEndIndex=i+1# Seek into the txt until you can find this word.try:# Try to update beginning indexbeg_idx=buf.index(word,beg_idx)exceptValueError:# Give up -- this will be something randomend_idx=beg_idx+len(word)token.beginChar=beg_idxtoken.endChar=end_idxbeg_idx,end_idx=end_idx,end_idxannotator=HappyFunTokenizer()# Calling .start() will launch the annotator as a service running on# port 8432 by default.annotator.start()# annotator.properties contains all the right properties for# Stanford CoreNLP to use this annotator.withcorenlp.CoreNLPClient(properties=annotator.properties,annotators="happyfun ssplit pos".split())asclient:ann=client.annotate("RT @ #happyfuncoding: this is a typical Twitter tweet :-)")tokens=[t.wordfortinann.sentence[0].token]print(tokens)


