Python legomena包_程序模块 - PyPI

用于探索文本中的类型、标记和nlegomena关系的工具。

legomena的Python项目详细描述

法律

用于探索文本中的类型、标记和n-legomena关系的工具。基于Davis 2019[1]研究论文。在

安装

pip install legomena

数据源

这个包可以由任何数据源驱动，但是作者已经测试了两个数据源：Natural Language ToolKit和{a4}。前者是pythonnlp应用程序的黄金标准，但只有区区18本书的gutenberg语料库。后者包含完整的55000多本古腾堡语料库，已经标记和计数。注意：这两个数据集的重叠使得not在它们的确切类型/令牌计数上是一致的，它们的方法是不同的，但是这个包将类型/令牌计数作为原始数据，因此与方法无关。在

^{pr2}$

基本用法：

演示笔记本可以找到here。单元测试可以找到here。在

# basic properties
corpus.tokens  # list of tokens
corpus.types  # list of types
corpus.fdist  # word frequency distribution dataframe
corpus.WFD  # alias for corpus.fdist
corpus.M  # number of tokens
corpus.N  # number of types
corpus.k  # n-legomena vector
corpus.k[n]  # n-legomena count (n=1 -> number of hapaxes)
corpus.hapax  # list of hapax legomena, alias for corpus.nlegomena(1)
corpus.dis  # list of dis legomena, alias for corpus.nlegomena(2)
corpus.tris  # list of tris legomena, alias for corpus.nlegomena(3)
corpus.tetrakis  # list of tetrakis legomena, alias for corpus.nlegomena(4)
corpus.pentakis  # list of pentakis legomena, alias for corpus.nlegomena(5)

# advanced properties
corpus.options  # tuple of optional settings
corpus.resolution  # number of samples to take to calculate TTR curve
corpus.dimension  # n-legomena vector length to pre-compute (max 6)
corpus.seed  # random number seed for sampling TTR data
corpus.TTR  # type-token ratio dataframe

# basic functions
corpus.nlegomena(n:int)  # list of types occurring exactly n times
corpus.sample(m:int)  # samples m tokens from corpus *without replacement*
corpus.sample(x:float)  # samples proportion x of corpus *without replacement*

类型令牌模型

在文献中有各种各样的模型预测类型的数量作为标记的函数，其中最著名的是Heap's Law。下面是一些由Corpus类覆盖的实现。在

# three models
model = HeapsModel()  # Heap's Law
model = InfSeriesModel(corpus)  # Infinite Series Model [1]
model = LogModel()  # Logarithmic Model [1]

# model fitting
m_tokens = corpus.TTR.m_tokens
n_types = corpus.TTR.n_types
model.fit(m_tokens, n_types)
predictions = model.fit_predict(m_tokens, n_types)

# model parameters
model.params

# model predictions
predictions = model.predict(m_tokens)

# log model only
dim = corpus.dimension
predicted_k = model.predict_k(m_tokens, dim)

演示应用程序

查看demo app来探索一些Gutenberg项目书籍中的类型标记和n-legomena计数。在

欢迎加入QQ群-->： 979659372

legomena 1.2.0

legomena的Python项目详细描述

法律

安装

数据源

基本用法：

类型令牌模型

演示应用程序

推荐PyPI第三方库

django-arrayfields

ultraif

arachnado

kisters.water.hydraulic-network.client

flask-paginate

openshiftx

GoldSaxTamilTranslate

UW-RestClients-Trumba

DFO-LS

odoo10-addon-account-banking-mandate-sale

pyndk

pyobjc-framework-StoreKit

django-cookie-control

SJSON

opal-referral

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

legomena 1.2.0

legomena的Python项目详细描述

法律

安装

数据源

基本用法：

类型令牌模型

演示应用程序

推荐PyPI第三方库

django-arrayfields

ultraif

arachnado

kisters.water.hydraulic-network.client

flask-paginate

openshiftx

GoldSaxTamilTranslate

UW-RestClients-Trumba

DFO-LS

odoo10-addon-account-banking-mandate-sale

pyndk

pyobjc-framework-StoreKit

django-cookie-control

SJSON

opal-referral

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签