文学语言处理:数字人文学科的语料库、模型和工具

llp的Python项目详细描述


律师事务所

文学语言处理(llp):数字人文学科的语料库、模型和工具。

快速启动

  1. 安装:
pip install llp                       # install with pip in terminal
    下载现有的语料库…
llp status                            # show which corpora/data are available
llp download ECCO_TCP                 # download a corpus

…或导入您自己的:

llp import                            # use the "import" command \
  -path_txt mycorpus/txts             # a folder of txt files  (use -path_xml for xml) \
  -path_metadata mycorpus/meta.xls    # a metadata csv/tsv/xls about those txt files \
  -col_fn filename                    # filename in the metadata corresponding to the .txt filename

…或者开始一个新的:

llp create                            # then follow the interactive prompt
  1. 然后可以用python加载语料库:
importllp# import llp as a python modulecorpus=llp.load('ECCO_TCP')# load the corpus by name or ID

…玩方便的语料库对象…

df=corpus.metadata# get corpus metadata as a pandas dataframesmpl=df.query('1740 < year < 1780')# do a quick query on the metadatatexts=corpus.texts()# get a convenient Text object for each texttexts_smpl=corpus.texts(smpl.id)# get Text objects for a specific list of IDs

…和文本对象:

fortextintexts_smpl:# loop over Text objectstext_meta=text.meta# get text metadata as dictionaryauthor=text.author# get common metadata as attributes    txt=text.txt# get plain text as stringxml=text.xml# get xml as stringtokens=text.tokens# get list of words (incl punct)words=text.words# get list of words (excl punct)counts=text.word_counts# get word counts as dictionary (from JSON if saved)ocracc=text.ocr_accuracy# get estimate of ocr accuracyspacy_obj=text.spacy# get a spacy text objectnltk_obj=text.nltk# get an nltk text objectblob_obj=text.blob# get a textblob object

语料库魔术

每个语料库对象都可以生成关于自身的数据:

corpus.save_metadata()# save metadata from xml files (if possible)corpus.save_plain_text()# save plain text from xml (if possible)corpus.save_mfw()# save list of all words in corpus and their total  countcorpus.save_freqs()# save counts as JSON filescorpus.save_dtm()# save a document-term matrix with top N words

您也可以在终端中运行这些命令:

llp install my_corpus                 # this is equivalent to python above
llp install my_corpus -parallel 4     # but can access parallel processing with MPI/Slingshot
llp install my_corpus dtm             # run a specific step

生成此类数据可以更方便地访问如下内容:

mfw=corpus.mfw(n=10000)# get the 10K most frequent wordsdtm=corpus.freqs(words=mfw)# get a document-term matrix as a pandas dataframe

您还可以构建word2vec模型:

w2v_model=corpus.word2vec()# get an llp word2vec model objectw2v_model.model()# run the modeling processw2v_model.save()# save the model somewheregensim_model=w2v_model.gensim# get the original gensim object

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
多线程重新构造使用线程池和BlockingQueue的I/O密集型Java web服务   java SWT CTabFolder检查CTAB是否存在   java如何防止具体类的实例化?   java如何将子域定向到正确的JBoss应用程序?   java Android外部文件出现不一致   java FileSystemNotFoundException:未安装提供程序“jndi”   显示jframe上的java隐藏单选按钮   java CXF客户端TCP连接在每个请求之间关闭   Hadoop查询、日期、循环、BASH或Java   java从长类型到十进制类型对象的转换类型错误   java为什么不在用户消息中提供用户提供的数据?有可能的威胁/攻击吗?   使用作用域存储MediaStore的应用程序中的java共享意图问题   java我可以通过将成员指针传递给方法并在方法中分配来初始化它吗?   java如何在一个包含正负值的数组中找到最大的负值?   java有比较二叉树路径的简单方法吗?   java Swagger(ui)不显示操作   java KairosDB缺失值的线性插值   用于此特定求和的java循环