用于spaCy的熊猫数据帧集成
dframc的Python项目详细描述
德拉姆西
DframCy是一个轻量级的实用模块,它将Pandas数据帧集成到spaCy的语言注释和训练任务中。DframCy提供干净的api来将spaCy的语言注释、Matcher和PhraseMatcher信息转换为Pandas dataframe,还支持从CSV/XLXS/XLS对NLP管道进行培训和评估,而不需要对spaCy的底层api进行任何更改。在
入门
DframCy易于安装。只需要以下几点:
要求
- Python 3.5或更高版本
- 熊猫
- 间距>=2.2.0
还需要下载spaCy的语言模型:
python -m spacy download en_core_web_sm
有关详细信息,请参阅:Models & Languages
安装:
可以从PyPi安装此包,方法是运行:
^{pr2}$从源代码构建:
git clone https://github.com/yash1994/dframcy.git
cd dframcy
python setup.py install
使用
语言注释
获取数据帧中的语言注释。有关语言注释(dataframe列名),请参阅spaCy's Token API文档。在
importspacyfromdframcyimportDframCynlp=spacy.load("en_core_web_sm")dframcy=DframCy(nlp)doc=dframcy.nlp(u"Apple is looking at buying U.K. startup for $1 billion")# default columns: ["id", "text", "start", "end", "pos_", "tag_", "dep_", "head", "ent_type_"]annotation_dataframe=dframcy.to_dataframe(doc)# can also pass columns names (spaCy's linguistic annotation attributes)annotation_dataframe=dframcy.to_dataframe(doc,columns=["text","lemma_","lower_","is_punct"])# for separate entity dataframetoken_annotation_dataframe,entity_dataframe=dframcy.to_dataframe(doc,separate_entity_dframe=True)# custom attributes can also be includedfromspacy.tokensimportTokenfruit_getter=lambdatoken:token.textin("apple","pear","banana")Token.set_extension("is_fruit",getter=fruit_getter)doc=dframcy.nlp(u"I have an apple")annotation_dataframe=dframcy.to_dataframe(doc,custom_attributes=["is_fruit"])
基于规则的匹配
# Token-based Matchingimportspacynlp=spacy.load("en_core_web_sm")fromdframcy.matcherimportDframCyMatcher,DframCyPhraseMatcherdframcy_matcher=DframCyMatcher(nlp)pattern=[{"LOWER":"hello"},{"IS_PUNCT":True},{"LOWER":"world"}]dframcy_matcher.add("HelloWorld",None,pattern)doc=dframcy_matcher.nlp("Hello, world! Hello world!")matches_dataframe=dframcy_matcher(doc)# Phrase Matchingdframcy_phrase_matcher=DframCyPhraseMatcher(nlp)terms=[u"Barack Obama",u"Angela Merkel",u"Washington, D.C."]patterns=[dframcy_phrase_matcher.get_nlp().make_doc(text)fortextinterms]dframcy_phrase_matcher.add("TerminologyList",None,*patterns)doc=dframcy_phrase_matcher.nlp(u"German Chancellor Angela Merkel and US President Barack Obama "u"converse in the Oval Office inside the White House in Washington, D.C.")phrase_matches_dataframe=dframcy_phrase_matcher(doc)
命令行界面
Dframcy支持命令行参数,用于将纯文本文件转换为CSV/JSON格式的语言注释文本,从CSV/XLS格式的训练数据中训练和评估语言模型。 Training data example。训练和评估的CLI参数与spaCy's CLI完全相同,唯一的区别是训练数据的格式。在
# convert dframcy convert -i plain_text.txt -o annotations.csv -t csv # train dframcy train -l en -o spacy_models -t train.csv -d test.csv # evaluate dframcy evaluate -m spacy_model/ -d test.csv # train text classifier dframcy textcat -o spacy_model/ -t data/textcat_training.csv -d data/textcat_training.csv
- 项目
标签: