Python qcrit包_程序模块 - PyPI

定量批评实验室

qcrit的Python项目详细描述

定量批评实验室的实用工具 https://www.qcrit.org

安装

使用pip：

pip install qcrit

用pipenv

pipenv install qcrit

关于

qcrit包包含一些实用程序，可以帮助处理和分析文献。

特征提取

特征是处理文献的结果。一个特征的例子可能是定冠词的数量、平均句子长度或疑问句的分数。“feature”一词还可以指计算此类值的python函数。

要计算特征，必须1）遍历语料库中的每个文本，2）将文本解析为标记，3）编写逻辑以计算特征，4）将结果输出到控制台或文件。此外，除非5）为使用相同标记的功能缓存标记化文本，否则运行速度会很慢。

使用textual_featuredecorator，步骤（1），（2），（4）和（5）被抽象出来-您只需要实现（3）逻辑来计算每个特性。

一旦您编写了一个功能作为python函数，就用decorator textual_feature标记它。您的功能必须只有一个参数，该参数被假定为文件的解析文本。

fromqcrit.textual_featureimporttextual_feature@textual_feature()defcount_definite_article(text):returntext.count('the')

textual_feature模块接受表示标记化类型的参数。

有四种支持的标记化类型：“句子”、“单词”、“句子单词”和“无”。这告诉函数它将接收“text”参数的格式。

如果没有，函数将以字符串形式接收文本参数。
如果是“句子”，则函数将以句子列表的形式接收文本参数，每个句子都是字符串
如果是“words”，则函数将以单词列表的形式接收文本参数
如果是“句子单词”，则函数将接收文本参数作为句子列表，每个句子作为单词列表

fromfunctoolsimportreduce@textual_feature(tokenize_type='sentences')defmean_sentence_len(text):sen_len=reduce(lambdacur_len,cur_sen:cur_len+len(cur_sen))num_sentences=len(text)returnsen_len/num_sentences

使用qcrit.extract_features.main运行所有用decorators标记的函数，并将结果输出到文件中。

corpus目录-用于搜索包含文本的文件的目录，它也将遍历所有子目录

file_extension_to_parse_函数-从要解析的文本的文件扩展名（例如“txt”、“tess”）映射到指导如何解析它的函数

output_file-要将结果输出到其中的文件，创建该文件以在机器学习阶段进行分析

为了使句子标记化正常工作，setup_tokenizers（）必须设置为被分析语言的末端标点符号。请确保在声明功能之前完成此操作。

fromqcrit.extract_featuresimportmain,parse_tessfromqcrit.textual_featureimportsetup_tokenizerssetup_tokenizers(terminal_punctuation=('.','?'))fromsomewhere_elseimportcount_definite_article,mean_sentence_lenmain(corpus_dir='demo',file_extension_to_parse_function={'tess':parse_tess},output_file='output.pickle')

输出：

Extracting features from .tess files in demo/
100%|██████████████████████████████████████████|4/4 [00:00<00:00,  8.67it/s]
Feature mining complete. Attempting to write feature results to "output.pickle"...
Success!


Feature mining elapsed time: 1.4919 seconds

分析

使用@model_analyzer()装饰符标记分析机器学习模型的函数

调用analyze_models.main('output.pickle', 'classifications.csv')以运行所有用@model_analyzer()装饰符标记的函数。只运行一个函数，包括作为第三个参数分析的函数名。main（）

output.pickle：现在特征已经被提取并输出到output.pickle中，我们可以在上面使用机器学习模型。

classifications.csv：文件classifications.csv在第一列中包含文件名以及第二栏中对语料库中每个文件的特定分类（散文或诗歌）。

importqcrit.analyze_modelsfromqcrit.model_analyzerimportmodel_analyzerfromsklearnimportensemblefromsklearn.model_selectionimporttrain_test_splitfromsklearn.metricsimportaccuracy_score@model_analyzer()deffeature_rankings(data,target,file_names,feature_names,labels_key):print('-'*40+'\nRandom Forest Classifier feature rankings\n')features_train,features_test,labels_train,_=train_test_split(data,target,test_size=0.5,random_state=0)clf=ensemble.RandomForestClassifier(random_state=0,n_estimators=10)clf.fit(features_train,labels_train)clf.predict(features_test)#Display features in order of importanceprint('Feature importances:')fortupinsorted(zip(feature_names,clf.feature_importances_),key=lambdas:-s[1]):print('\t%f: %s'%(tup[1],tup[0]))@model_analyzer()defclassifier_accuracy(data,target,file_names,feature_names,labels_key):print('-'*40+'\nRandom Forest Classifier accuracy\n')features_train,features_test,labels_train,labels_test=train_test_split(data,target,test_size=0.5,random_state=0)clf=ensemble.RandomForestClassifier(random_state=0,n_estimators=10)clf.fit(features_train,labels_train)results=clf.predict(features_test)print('Stats:')print('\tNumber correct: '+str(accuracy_score(labels_test,results,normalize=False))+' / '+str(len(results)))print('\tPercentage correct: '+str(accuracy_score(labels_test,results)*100)+'%')@model_analyzer()defmisclassified_texts(data,target,file_names,feature_names,labels_key):print('-'*40+'\nRandom Forest Classifier misclassified texts\n')features_train,features_test,labels_train,labels_test,idx_train,idx_test=train_test_split(data,target,range(len(target)),test_size=0.5,random_state=0)print('Train texts:\n\t'+'\n\t'.join(file_names[i]foriinidx_train)+'\n')print('Test texts:\n\t'+'\n\t'.join(file_names[i]foriinidx_test)+'\n')clf=ensemble.RandomForestClassifier(random_state=0,n_estimators=10)clf.fit(features_train,labels_train)results=clf.predict(features_test)print('Misclassifications:')fori,_inenumerate(results):ifresults[i]!=labels_test[i]:print('\t'+file_names[idx_test[i]])qcrit.analyze_models.main('output.pickle','classifications.csv')

输出：

----------------------------------------
Random Forest Classifier feature rankings

Feature importances:
	0.400000: num_conjunctions
	0.400000: num_interrogatives
	0.200000: mean_sentence_length


Elapsed time: 0.0122 seconds

----------------------------------------
Random Forest Classifier accuracy

Stats:
	Number correct: 1 / 2
	Percentage correct: 50.0%


Elapsed time: 0.0085 seconds

----------------------------------------
Random Forest Classifier misclassified texts

Train texts:
	demo/aristotle.poetics.tess
	demo/aristophanes.ecclesiazusae.tess

Test texts:
	demo/euripides.heracles.tess
	demo/plato.respublica.part.1.tess

Misclassifications:
	demo/plato.respublica.part.1.tess


Elapsed time: 0.0082 seconds

开发

要激活虚拟环境，请确保已安装pipenv，然后运行以下命令：

pipenv shell
pipenv install --dev

演示

python demo/demo.py

提交

以下命令将把包提交到Python Package Index。可能需要增加setup.py目录中的版本号，并删除以前生成的dist/和build/目录。

python setup.py bdist_wheel sdist
twine upload dist/*

欢迎加入QQ群-->： 979659372

qcrit 0.0.12

qcrit的Python项目详细描述

安装

关于

特征提取

分析

开发

演示

提交

推荐PyPI第三方库

octopus-ci

py2bit

findtext

pyvis

MDsrv

tc-as-a-service

twitter_of_babble

healthcheck

UNL

hggit3

firewatch

asyncflux

pyrange

Spark-lean

typedtensor

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

qcrit 0.0.12

qcrit的Python项目详细描述

安装

关于

特征提取

分析

开发

演示

提交

推荐PyPI第三方库

octopus-ci

py2bit

findtext

pyvis

MDsrv

tc-as-a-service

twitter_of_babble

healthcheck

UNL

hggit3

firewatch

asyncflux

pyrange

Spark-lean

typedtensor

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签