文学语言处理:数字人文学科的语料库、模型和工具
llp的Python项目详细描述
律师事务所
文学语言处理(llp):数字人文学科的语料库、模型和工具。
快速启动
- 安装:
pip install llp # install with pip in terminal
-
下载现有的语料库…
llp status # show which corpora/data are available
llp download ECCO_TCP # download a corpus
…或导入您自己的:
llp import # use the "import" command \
-path_txt mycorpus/txts # a folder of txt files (use -path_xml for xml) \
-path_metadata mycorpus/meta.xls # a metadata csv/tsv/xls about those txt files \
-col_fn filename # filename in the metadata corresponding to the .txt filename
…或者开始一个新的:
llp create # then follow the interactive prompt
- 然后可以用python加载语料库:
importllp# import llp as a python modulecorpus=llp.load('ECCO_TCP')# load the corpus by name or ID
…玩方便的语料库对象…
df=corpus.metadata# get corpus metadata as a pandas dataframesmpl=df.query('1740 < year < 1780')# do a quick query on the metadatatexts=corpus.texts()# get a convenient Text object for each texttexts_smpl=corpus.texts(smpl.id)# get Text objects for a specific list of IDs
…和文本对象:
fortextintexts_smpl:# loop over Text objectstext_meta=text.meta# get text metadata as dictionaryauthor=text.author# get common metadata as attributes txt=text.txt# get plain text as stringxml=text.xml# get xml as stringtokens=text.tokens# get list of words (incl punct)words=text.words# get list of words (excl punct)counts=text.word_counts# get word counts as dictionary (from JSON if saved)ocracc=text.ocr_accuracy# get estimate of ocr accuracyspacy_obj=text.spacy# get a spacy text objectnltk_obj=text.nltk# get an nltk text objectblob_obj=text.blob# get a textblob object
语料库魔术
每个语料库对象都可以生成关于自身的数据:
corpus.save_metadata()# save metadata from xml files (if possible)corpus.save_plain_text()# save plain text from xml (if possible)corpus.save_mfw()# save list of all words in corpus and their total countcorpus.save_freqs()# save counts as JSON filescorpus.save_dtm()# save a document-term matrix with top N words
您也可以在终端中运行这些命令:
llp install my_corpus # this is equivalent to python above
llp install my_corpus -parallel 4 # but can access parallel processing with MPI/Slingshot
llp install my_corpus dtm # run a specific step
生成此类数据可以更方便地访问如下内容:
mfw=corpus.mfw(n=10000)# get the 10K most frequent wordsdtm=corpus.freqs(words=mfw)# get a document-term matrix as a pandas dataframe
您还可以构建word2vec模型:
w2v_model=corpus.word2vec()# get an llp word2vec model objectw2v_model.model()# run the modeling processw2v_model.save()# save the model somewheregensim_model=w2v_model.gensim# get the original gensim object