Python zensols.ngramdb包_程序模块 - PyPI

创建SQLite数据库ngrams。

zensols.ngramdb的Python项目详细描述

从googlengrams数据库创建一个SQLite数据库

从Google创建一个包含一百万n-gram数据集的SQLite数据库。 {{a5}下载并创建一个^代码包含内容的数据库文件。它还为n-gram提供了一个简单的API 抬头看。在

安装

为了让生活更轻松，请安装GNU Make。如果你没有，你需要按照makefile中给出的步骤操作。在
下载一百万n-gram data sets:make download。这应该是用几分钟的时间连接好互联网。在
取消压缩加载文件：make uncompress。在
从下载的语料库创建并加载SQLite数据库：make load。根据处理器的速度，这应该需要大约一个小时，并且在data/eng-1gram.db中创建一个在磁盘上占用18G的文件。在
从命令行从源（make install）或从 pip。在

如果要在命令行上使用程序（而不是API），在~/.ngramdbrc中创建一个包含以下内容的文件：

[default][ngram_db]data_dir=${HOME}/path/to/eng-1gram.db

数据大小

如前所述，SQLite数据库文件在磁盘上占用18G。这是因为几十年来一直发生。在许多情况下，较老的n-gram不是根据数据的大小，需要和查询可能需要一段时间。数据可以在任何SQLite接口中使用以下SQL最小化（即MacOS sqlite3（在命令行上）：

^{pr2}$

在本例中，1990年以前从出版物中记录的所有n-gram 被删除。在

使用

此项目可以从命令行使用，也可以用作API。在

命令行

从命令行使用：

% ngramdb query -g the -y 20056313626900.56880%

这给出了自以及该单字在语料库中占所有单词的百分比。在

编程接口

在installation部分中，创建~/.ngramdbrc 配置文件。还请注意，API被配置为易于使用其他使用zensols.actioncli配置API的Python项目。在

fromzensols.ngramdbimportAppConfig,Queryconf=AppConfig.instance().app_configquery=Query(conf)stash=query.stashn_occurs=stash['The']print(f'{n_occurs}{100*n_occurs/len(stash):.5f}%')=>6313626900.56880%

eh3数据分析

stash access对于特定的用例是很好的其中一部分语料是必要的。但是创建一个可选择的数据格式也是为了进行数据分析。下面是一个示例，说明如何使用Pandas直接对创建的 SQLite文件了解语料库：

fromzensols.actioncli.timeimporttimeimportpandasaspdimportsqlite3ass# "connect" to the SQLite database filedb='/path/to/data/directory/eng-1gram.db'conn=s.connect(db)# create a dataframe with all entries on or after 1990sql='select grams, match_count as cnt from ngram where yr >= 1990'withtime('{rc} rows read'):df=pd.read_sql_query(sql,conn)rc=len(df)#=> 28150989 rows read finished in 60.8s# create a data frame with a ngram text and number of match counts per rowwithtime('groupby of {rc} rows'):dfg=df.groupby(['grams'],as_index=False).agg({'cnt':'sum'})rc=len(df)#=> group of 28150989 rows finished in 7.6s# get the number of counts on 'the'dfg[dfg['grams']=='the']#=>         grams        cnt#=> 2819462   the  621594750# all token occurrencesall_cnt=df['cnt'].sum()all_cnt#=> 13269089201# calculate at the poulation of a few wordsforwordin'the The . cat dog phone iPhone'.split():occ=dfg[dfg['grams']==word].cnt.item()pop=occ/all_cntprint(f'word \'{word}\' found {occ} times, which is {pop*100:.5f}% of the corpus')#=> word 'the' found 621594750 times, which is 4.68453% of the corpus#=> word 'The' found 77576794 times, which is 0.58464% of the corpus#=> word '.' found 641792317 times, which is 4.83675% of the corpus#=> word 'cat' found 247075 times, which is 0.00186% of the corpus#=> word 'dog' found 453789 times, which is 0.00342% of the corpus#=> word 'phone' found 522190 times, which is 0.00394% of the corpus#=> word 'iPhone' found 178 times, which is 0.00000% of the corpuswithtime('pickled data frame'):df.to_pickle('df.dat')#=> pickled data frame finished in 8.0swithtime('write to csv'):df.to_csv('df.csv')#=> write to csv finished in 44.3swithtime('read data from'):df=pd.read_pickle('df.dat')#=> read data from finished in 2.4s

获取

安装命令行程序最简单的方法是通过pip安装程序：

pip3 install zensols.ngramdb

二进制文件在pypi上也可用。在

变更日志

大量的变更日志可用here。在

许可证

兹免费准许任何人取得本软件及相关文档文件（以下简称“软件”）的处理本软件不受限制，包括但不限于使用、复制、修改、合并、发布、分发、再授权和/或出售副本并允许提供软件的人员因此，在满足以下条件的前提下：

本软件按“原样”提供，无任何形式的保证，明示或包括但不限于适销性保证，特定目的的适用性和非侵犯性。在任何情况下作者或版权持有人应对任何索赔、损害赔偿或其他无论是在合同诉讼、侵权诉讼或其他诉讼中，由以下原因引起的责任：，与软件、使用或其他有关的交易软件。在

欢迎加入QQ群-->： 979659372

zensols.ngramdb 0.0.2

zensols.ngramdb的Python项目详细描述

从googlengrams数据库创建一个SQLite数据库

目录

安装

数据大小

使用

命令行

编程接口

获取

变更日志

许可证

推荐PyPI第三方库

python-amazon-unthrottled-paapi

sirex

scikit-ika

jsonpathgenerator

audiencemanager

web3data

gitrecovery1

mcLib

BinanceAsyncWebsocket

aiomultitask

qiskit-honeywell-provider

cdk-fargate-fastautlscaler

canvas-lms-api

connection-monitor

mlinspect

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

zensols.ngramdb 0.0.2

zensols.ngramdb的Python项目详细描述

从googlengrams数据库创建一个SQLite数据库

目录

安装

数据大小

使用

命令行

编程接口

获取

变更日志

许可证

推荐PyPI第三方库

python-amazon-unthrottled-paapi

sirex

scikit-ika

jsonpathgenerator

audiencemanager

web3data

gitrecovery1

mcLib

BinanceAsyncWebsocket

aiomultitask

qiskit-honeywell-provider

cdk-fargate-fastautlscaler

canvas-lms-api

connection-monitor

mlinspect

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签