支持60多种语言的标记化、分句、引理化、标记和解析的自然语言处理工具包

nlpcube的Python项目详细描述


MonthlyWeeklydailyVersionPython 3

新闻

[2019年4月15日]-我们正在发布1.1版模型-检查所有supported languages below。1.0和1.1模型都在相同的UD2.2 corpus上进行训练;但是,1.1模型不使用向量嵌入,因此减少了使用它们所需的磁盘空间和时间。有些语言的准确度略有提高,有些则有所下降。默认情况下,NLP多维数据集将使用最新(此时)的1.1模型。

要使用旧的1.0模型,只需在load调用中指定此版本:cube.load("en", 1.0)en用于英语或任何其他语言代码)。这将下载(如果尚未下载)并使用this特定的模型版本。任何你想使用的语言/版本都一样。

如果已经安装了NLP多维数据集,希望使用较新的1.1模型,请键入cube.load("en", 1.1)cube.load("en", "latest")以自动下载它们。之后,调用cube.load("en")而不使用版本号将自动使用磁盘上的最新版本。


NLP立方体

nlp多维数据集是一个开源自然语言处理框架,支持UD Treebanks(下面列出了所有可用语言)中包含的语言。如果需要,请使用NLP多维数据集:

  • 句子分段
  • 标记化
  • 词性标注(独立于语言(upos)和依赖于语言(xpos和attrs))
  • 元素化
  • 依赖项分析

示例输入:“这是一个测试。”,输出是:

1       This    this    PRON    DT      Number=Sing|PronType=Dem        4       nsubj   _
2       is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   4       cop     _
3       a       a       DET     DT      Definite=Ind|PronType=Art       4       det     _
4       test    test    NOUN    NN      Number=Sing     0       root    SpaceAfter=No
5       .       .       PUNCT   .       _       4       punct   SpaceAfter=No

如果您只想运行它,下面是如何设置它并在几行中使用nlp多维数据集:Quick Start Tutorial

对于想要创建和训练自己的模型的高级用户,请参阅examples/中的高级教程,从如何locally install NLP-Cube开始。

简单(PIP)安装/更新版本

使用以下命令安装(或更新)NLP多维数据集:

pip3 install -U nlpcube

API使用

要以编程方式使用nlp cube*(在python中),请按照this tutorial 总结如下:

from cube.api import Cube       # import the Cube object
cube=Cube(verbose=True)         # initialize it
cube.load("en")                 # select the desired language (it will auto-download the model on first run)
text="This is the text I want segmented, tokenized, lemmatized and annotated with POS and dependencies."
sentences=cube(text)            # call with your own text (string) to obtain the annotations

sentences对象现在包含注释文本,一次一个句子。要打印第三个单词的pos(在第一句中),只需运行:

print(sentences[0][2].upos) # [0] is the first sentence and [2] is the third word

每个令牌对象都有以下属性:indexwordlemmauposxposattrsheadlabeldepsspace_after。有关每个属性的详细信息,请参见标准conll格式。

网络服务器使用

要将nlp多维数据集用作web服务,您需要 locally install NLP-Cube 启动服务器:

例如,以下命令将启动服务器并预加载语言:en、fr和de。

cd cube
python3 webserver.py --port 8080 --lang=en --lang=fr --lang=de

要进行测试,请打开以下link(请复制链接的地址,因为它是本地地址和端口链接)

引用

如果您在研究中使用NLP Cube,我们将非常感谢您引用以下论文:

或者,bibtex格式:

@InProceedings{boro-dumitrescu-burtica:2018:K18-2,
  author    = {Boroș, Tiberiu  and  Dumitrescu, Stefan Daniel  and  Burtica, Ruxandra},
  title     = {{NLP}-Cube: End-to-End Raw Text Processing With Neural Networks},
  booktitle = {Proceedings of the {CoNLL} 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},
  month     = {October},
  year      = {2018},
  address   = {Brussels, Belgium},
  publisher = {Association for Computational Linguistics},
  pages     = {171--179},
  abstract  = {We introduce NLP-Cube: an end-to-end Natural Language Processing framework, evaluated in CoNLL's "Multilingual Parsing from Raw Text to Universal Dependencies 2018" Shared Task. It performs sentence splitting, tokenization, compound word expansion, lemmatization, tagging and parsing. Based entirely on recurrent neural networks, written in Python, this ready-to-use open source system is freely available on GitHub. For each task we describe and discuss its specific network architecture, closing with an overview on the results obtained in the competition.},
  url       = {http://www.aclweb.org/anthology/K18-2017}
}

Languages and performance

Results are reported against the test files for each language (available in the UD 2.2 corpus) using the 2018 conll eval script. Please see more info about what each metric represents here

注意:

  • 版本1.1的模型不再需要大的外部向量嵌入文件。这使得加载1.1模型的速度更快,内存占用更少。
  • 所有报告的结果都是2端的。(例如,我们在自己的分段文本上测试标记的准确性,因为这是真实的用例;conll结果大多报告在“gold”或预分段文本上,从而提高了标记器/解析器/等的准确性)
LanguageModelTokenSentenceUPOSXPOSAllTagsLemmasUASLAS
Afrikaans
af-1.099.9799.6597.2893.091.5396.4287.6183.96
af-1.199.9999.2996.7292.2990.8796.4887.3283.31
Ancient Greek
grc-1.0100.018.1394.9295.3284.1786.5972.4467.73
grc-1.1100.017.6196.8797.3588.3688.4173.469.36
Arabic
ar-1.099.9861.0573.4269.7568.1241.2653.9450.31
ar-1.199.9960.5373.2768.9865.9540.8753.0649.45
Armenian
hy-1.097.3487.5274.1396.7641.5160.5811.411.7
Basque
eu-1.099.9799.8394.9399.9787.2490.7585.4981.35
eu-1.199.9799.7595.099.9788.1490.7485.180.91
Bulgarian
bg-1.099.9492.898.5195.693.9991.5992.3888.84
bg-1.199.9393.3698.3695.9194.4692.0292.3988.76
Buryat
bxr-1.083.2631.5238.0883.2616.7416.0514.446.5
Catalan
ca-1.099.9899.2798.1798.2396.6397.8392.3389.95
ca-1.199.9999.5198.298.2296.7297.892.1489.6
Chinese
zh-1.093.0399.188.2288.1586.9192.7473.4369.52
zh-1.192.3499.186.7586.6685.3592.0571.067.04
Croatian
hr-1.099.9295.5697.6699.9289.4993.8590.6185.77
hr-1.199.9595.8497.5699.9589.4994.0189.9584.97
Czech
cs-1.099.9983.7998.7595.5493.6195.7990.6788.46
cs-1.199.9984.1998.5495.3394.0995.790.7288.52
Danish
da-1.099.8591.7996.7999.8594.2996.5385.9383.05
da-1.199.8292.6496.5299.8294.3996.2185.0981.83
Dutch
nl-1.099.8990.7595.4993.8491.7395.7289.4886.1
nl-1.199.9190.8995.6293.9292.5895.8789.7686.4
English
en-1.099.2572.895.3494.8392.4895.6284.781.93
en-1.199.270.9494.493.9391.0495.1883.380.32
Estonian
et-1.099.991.8196.0297.1891.3593.2686.0482.29
et-1.199.9191.9296.897.9293.1793.986.1382.91
Finnish
fi-1.099.788.7395.4596.4490.2983.6987.1883.89
fi-1.199.6589.2396.2297.0791.884.0287.8384.96
French
fr-1.099.6894.292.6195.4690.7993.0884.9680.91
fr-1.199.6795.3192.5195.4590.893.083.8880.16
Galician
gl-1.099.8997.1683.0182.5181.5882.9565.6961.08
gl-1.199.9197.2882.682.1280.9682.7162.6558.2
German
de-1.099.781.1991.3894.2680.3775.879.674.35
de-1.199.7781.9990.4793.8279.7975.4679.373.87
Gothic
got-1.0100.021.5993.193.880.5883.7467.2359.67
Greek
el-1.099.8889.4693.793.5487.1488.9285.6382.05
el-1.199.8889.5393.2893.2487.9588.6584.5179.88
Hebrew
he-1.099.9399.6954.1354.1751.4954.1334.8432.29
he-1.199.94100.052.7852.7849.953.4532.1329.42
Hindi
hi-1.099.9898.8497.1696.4390.2997.4894.6691.26
hi-1.1100.099.1196.8196.2889.7497.494.5690.96
Hungarian
hu-1.099.894.1894.5299.886.2291.0781.5775.95
hu-1.199.8897.7793.1199.8886.7991.1877.8970.94
Indonesian
id-1.099.9593.5993.1394.1587.6582.1985.0178.18
id-1.1100.094.5892.9592.8186.2781.5184.7377.99
Irish
ga-1.099.5695.3890.9590.0774.187.5176.3264.74
Italian
it-1.099.8998.1486.8686.6784.9787.0378.374.59
it-1.199.9299.0786.5886.484.5386.7576.3872.35
Japanese
ja-1.092.7394.9290.0592.7390.0291.7580.4777.97
ja-1.192.4294.9290.2892.4290.2891.6679.9477.79
Kazakh
kk-1.092.2675.5757.3855.7522.1221.3539.5519.48
Korean
ko-1.099.8793.994.6686.9283.8138.785.5281.39
ko-1.199.8894.2394.6188.4185.2738.6885.1680.89
Kurmanji
kmr-1.089.9288.8653.6652.5225.9653.9412.065.53
Latin
la-1.099.9792.597.9593.7591.7696.989.286.29
la-1.199.9992.7598.2294.0392.1697.1889.1986.58
Latvian
lv-1.099.6696.3593.4382.5279.9989.4783.0477.98
North Sami
sme-1.099.7598.7986.0787.3871.3480.966.5456.93
Norwegian
no_bokmaal-1.099.9290.9384.2499.9273.6871.6878.2470.83
no_bokmaal-1.199.9290.3284.6999.9274.8471.4777.7170.63
no_nynorsk-1.099.9691.0897.3399.9693.8785.8290.3388.02
no_nynorsk-1.199.9692.1897.4799.9694.7586.0790.2387.98
Old Church Slavonic
cu-1.0100.028.9992.8893.0981.8583.1672.1865.43
Persian
fa-1.0100.097.9196.3496.1795.5189.488.3585.08
fa-1.1100.099.095.9295.7895.0589.3287.4383.38
Portuguese
pt-1.099.6987.8885.0288.3981.3586.2376.3872.99
pt-1.199.7588.184.3988.4679.7985.8575.1171.61
Romanian
ro-1.099.7495.5697.4296.5995.4996.9190.3885.23
ro-1.199.7195.4296.9696.3294.9896.5790.1485.06
Russian
ru-1.099.7198.7998.499.7195.5593.8992.790.97
ru-1.199.7398.598.4899.7395.3793.892.8890.99
Serbian
sr-1.099.9792.6197.6199.9791.5492.9390.8986.92
sr-1.199.9792.097.8899.9792.5793.3190.9687.04
Slovak
sk-1.099.9786.095.8282.378.4390.3588.8385.69
sk-1.199.9586.6795.3381.0176.9889.8787.6483.84
Slovenian
sl-1.099.9197.5197.8592.5291.2796.3591.489.38
sl-1.199.8797.6497.6293.2990.9996.3691.4689.19
Spanish
es-1.099.9898.3298.098.096.6298.0590.5388.27
es-1.199.9898.498.0198.096.697.9990.5188.16
Swedish
sv-1.099.9492.5497.2195.1892.8897.0688.0984.74
sv-1.199.3691.2292.740.00.089.3778.1471.86
Turkish
tr-1.099.8997.490.3789.5681.5987.465.2258.26
tr-1.199.8896.7990.7990.1783.2687.8464.6957.07
Ukrainian
uk-1.099.6593.9696.3188.2386.092.0886.2582.96
uk-1.199.7693.5896.088.1785.3992.2884.981.04
Upper Sorbian
hsb-1.098.5969.1559.6198.5937.9622.3311.113.35
Urdu
ur-1.0100.098.693.5591.6977.4197.3387.8681.99
ur-1.1100.098.692.8591.0277.1897.287.1280.83
Uyghur
ug-1.099.9183.8387.8591.5873.9390.1774.3660.5
ug-1.199.784.1888.0790.3875.2892.2875.1662.13
Vietnamese
vi-1.087.292.8878.3576.4376.1881.4751.5945.49
vi-1.186.8792.5176.7274.5772.2781.3150.2943.76

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java出现编译错误,我不理解   java在gnu-trove库中是否有任何有序映射?   java Servlet应该有映射,但找不到Servlet   java时间和第二期显示1:10,而不是13:10   java Play Framework 1.2.7 Heroku更新崩溃   线程“main”java中的opencsv异常。lang.NoClassDefFoundError:org/apache/commons/lang3/ObjectUtils   selenium在java中隐藏警告消息   java使用ID引用将JSON实体反序列化为POJO   java无法在JRE 8中加载字体   一个线程中的异常/错误会使整个应用程序停止吗?   java访问重复子规则的元素标签;e、 g.用ANTLR解析(1,2,3)中的a   java如何从平移旋转中找到新坐标   使用HTML Java小程序托管jar文件存在安全问题   java如何按频率而不是字母顺序排列字符串数组   java清除bufferedReader和块以获得更多输入   java解密SAML2断言