Python llp包_程序模块 - PyPI

文学语言处理：数字人文学科的语料库、模型和工具

llp的Python项目详细描述

律师事务所

文学语言处理（llp）：数字人文学科的语料库、模型和工具。

快速启动

安装：

pip install llp                       # install with pip in terminal

下载现有的语料库…

llp status                            # show which corpora/data are available
llp download ECCO_TCP                 # download a corpus

…或导入您自己的：

llp import                            # use the "import" command \
  -path_txt mycorpus/txts             # a folder of txt files  (use -path_xml for xml) \
  -path_metadata mycorpus/meta.xls    # a metadata csv/tsv/xls about those txt files \
  -col_fn filename                    # filename in the metadata corresponding to the .txt filename

…或者开始一个新的：

llp create                            # then follow the interactive prompt

然后可以用python加载语料库：

importllp# import llp as a python modulecorpus=llp.load('ECCO_TCP')# load the corpus by name or ID

…玩方便的语料库对象…

df=corpus.metadata# get corpus metadata as a pandas dataframesmpl=df.query('1740 < year < 1780')# do a quick query on the metadatatexts=corpus.texts()# get a convenient Text object for each texttexts_smpl=corpus.texts(smpl.id)# get Text objects for a specific list of IDs

…和文本对象：

fortextintexts_smpl:# loop over Text objectstext_meta=text.meta# get text metadata as dictionaryauthor=text.author# get common metadata as attributes    txt=text.txt# get plain text as stringxml=text.xml# get xml as stringtokens=text.tokens# get list of words (incl punct)words=text.words# get list of words (excl punct)counts=text.word_counts# get word counts as dictionary (from JSON if saved)ocracc=text.ocr_accuracy# get estimate of ocr accuracyspacy_obj=text.spacy# get a spacy text objectnltk_obj=text.nltk# get an nltk text objectblob_obj=text.blob# get a textblob object

语料库魔术

每个语料库对象都可以生成关于自身的数据：

corpus.save_metadata()# save metadata from xml files (if possible)corpus.save_plain_text()# save plain text from xml (if possible)corpus.save_mfw()# save list of all words in corpus and their total  countcorpus.save_freqs()# save counts as JSON filescorpus.save_dtm()# save a document-term matrix with top N words

您也可以在终端中运行这些命令：

llp install my_corpus                 # this is equivalent to python above
llp install my_corpus -parallel 4     # but can access parallel processing with MPI/Slingshot
llp install my_corpus dtm             # run a specific step

生成此类数据可以更方便地访问如下内容：

mfw=corpus.mfw(n=10000)# get the 10K most frequent wordsdtm=corpus.freqs(words=mfw)# get a document-term matrix as a pandas dataframe

您还可以构建word2vec模型：

w2v_model=corpus.word2vec()# get an llp word2vec model objectw2v_model.model()# run the modeling processw2v_model.save()# save the model somewheregensim_model=w2v_model.gensim# get the original gensim object

欢迎加入QQ群-->： 979659372

llp 0.2.2

llp的Python项目详细描述

律师事务所

快速启动

语料库魔术

推荐PyPI第三方库

market

collective.blog

pyHSICLasso

deform_markdown

diffx

docgen

git-wipe

pyover

pyimagemonke

django-dropbox-upload-handler

odoo8-addon-hr-salary-rule-reference

animals-math

hikload

russian-post-tracking

pythonista-stubs

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

llp 0.2.2

llp的Python项目详细描述

律师事务所

快速启动

语料库魔术

推荐PyPI第三方库

market

collective.blog

pyHSICLasso

deform_markdown

diffx

docgen

git-wipe

pyover

pyimagemonke

django-dropbox-upload-handler

odoo8-addon-hr-salary-rule-reference

animals-math

hikload

russian-post-tracking

pythonista-stubs

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签