自然语言结构库

nlstruct的Python项目详细描述


主要特点

基于Pandas的预处理

大多数输入或推断的数据可以表示为特征、id和span索引的数据帧。在

因此,该库利用pandas高级帧索引和组合特征,使预处理快速而明确。在

简单嵌套/关系批处理

在结构化问题(尤其是文本数据)中,特征可以是高度相关的。在

这个库引入了一个灵活的,但性能良好的批处理结构,允许在numpy、scipy和torch矩阵之间切换,并且可以轻松地分割关系数据。在

缓存

使用显式、灵活、高效的缓存机制,可以方便地缓存每个过程。智能参数散列函数已经被编写来无缝地处理numpy、pandas和torch(cuda/cpu)数据结构和模型,并在多台机器上返回唯一的散列值。在

这种缓存机制对于检查点模型、从给定时代重新开始训练、以及立即预处理常用数据非常有用。也可以在缓存加载期间保存和重放日志。在

其他功能

  • 相对可共享路径
  • 拆分和转换文本,记录转换并将其应用于跨度或在预测时反转它们
  • “许多”数据加载器,以方便访问nlp数据集并提高再现性
  • 培训助手从最后一个检查点(如果有)无缝地重新开始培训
  • 随机种子助手,以确保再现性(处理内置、numpy和torch随机函数)
  • 彩色漂亮的桌子来监视训练
  • 多标记器(spacy、transformers、regex)
  • brat/conll出口商
  • 自动处理提及范围的线性标记CRF<;->;标记标记转换

示例

以下是NCBI数据集的自定义预处理示例。在

我们将文档分成句子,对文本应用替换并转换 提到相应的跨度。最后,我们创建一个批处理结构来遍历文档并自动查询相关的句子、标记和提及。在

>>>fromnlstruct.dataloadersimportload_ncbi_disease>>>fromnlstruct.utilsimportnormalize_vocabularies,encode_ids,df_to_csr,assign_sorted_id>>>fromnlstruct.textimporthuggingface_tokenize,split_into_spans,regex_sentencize,apply_substitutions,apply_deltas,partition_spans>>>fromnlstruct.collectionsimportBatcher>>>fromtransformersimportAutoTokenizer>>>ncbi=load_ncbi_disease()>>>docs,mentions,fragments=ncbi[["docs","mentions","fragments"]]>>>mentions=mentions.merge(fragments)>>>print(ncbi)Dataset((docs):792*('doc_id','text','split')(mentions):6881*('doc_id','mention_id','category')(labels):7059*('label_id','doc_id','mention_id','label')(fragments):6881*('doc_id','mention_id','begin','end','fragment_id'))>>>sentences=regex_sentencize(docs,reg_split='((?<=\.)[\n ](?:[A-Z]))')>>>[mentions]=partition_spans([mentions],sentences,overlap_policy=False)[0]>>>print(sentences.head(5))sentence_idxbeginendtextdoc_idsplitsentence_id00077Acommonhumanskintumour...10192393train0/01178141WNTsignallingorchestrate...10192393train0/122142312Inresponsetothisstimul...10192393train0/233313477Oneofthetargetgenesfo...10192393train0/344478742Mostcoloncancersarisef...10192393train0/4>>>print(mentions.head(5))doc_idsentence_idmention_idfragment_idcategorybeginend0101923930/010192393-010192393-0DiseaseClass15261101923930/310192393-110192393-1DiseaseClass1301362101923930/410192393-210192393-2DiseaseClass5183101923930/410192393-310192393-3SpecificDisease61874101923930/410192393-410192393-4SpecificDisease8992>>># Get substitutions `deltas` to shift train mentions and restored true char positions on predictions>>>sentences,deltas=apply_substitutions(sentences,[r"sk(.)n"],[r"sk\1\1\1\1\1\1n"],doc_cols=("doc_id","sentence_id"),apply_unidecode=True)>>>mentions=apply_deltas(mentions,deltas,on="sentence_id")>>>print(sentences.head(5))textsentence_idxdoc_idsplitsentence_id0Acommonhumanskiiiiiint...010192393train0/01WNTsignallingorchestrate...110192393train0/12Inresponsetothisstimul...210192393train0/23Oneofthetargetgenesfo...310192393train0/34Mostcoloncancersarisef...410192393train0/4>>>print(mentions.head(5))doc_idsentence_idmention_idfragment_idcategorybeginend0101923930/010192393-010192393-0DiseaseClass15.031.0<--noticethattheendhasmovedduetothesubstitution1101923930/310192393-110192393-1DiseaseClass130.0136.02101923930/410192393-210192393-2DiseaseClass5.018.03101923930/410192393-310192393-3SpecificDisease61.087.04101923930/410192393-410192393-4SpecificDisease89.092.0>>>tokens=huggingface_tokenize(sentences,AutoTokenizer.from_pretrained('camembert-base'))>>># Express mentions as token spans instead of char spans>>>mentions=split_into_spans(mentions,tokens,pos_col="token_idx")>>>mentions=assign_sorted_id(mentions,"mention_idx",groupby=["doc_id","sentence_id"],sort_on="begin")>>>print(tokens.head(5))idtoken_idtoken_idxtokenbeginendsentence_idxdoc_idsplitsentence_id0000<s>00010192393train0/01011A01010192393train0/02022comm26010192393train0/03033on68010192393train0/0404499010192393train0/0>>>print(mentions.head(5))doc_idsentence_idmention_idfragment_idcategorybeginendmention_idx010094559234/110094559-110094559-1SpecificDisease17013258663548/13258663-23258663-2SpecificDisease1100210633128465/110633128-110633128-1SpecificDisease17038252631268/78252631-118252631-11Modifier14047437512417/07437512-07437512-0SpecificDisease190>>># Encode object / strings etc that is not an id as a pandas categories>>>[sentences,tokens,mentions],vocabularies=normalize_vocabularies([sentences,tokens,mentions],train_vocabularies={"text":False})>>># Encode doc/sentence/mention ids as integers>>>unique_mention_ids=encode_ids([mentions],("doc_id","mention_id"),inplace=True)>>>unique_sentence_ids=encode_ids([sentences,mentions,tokens],("doc_id","sentence_id"),inplace=True)>>>unique_doc_ids=encode_ids([docs,sentences,mentions,tokens],"doc_id",inplace=True)>>># Create the batcher collection>>>batcher=Batcher({>>>"doc":{>>>"doc_id":docs["doc_id"],>>>"sentence_id":df_to_csr(sentences["doc_id"],sentences["sentence_idx"],sentences["sentence_id"]),>>>"sentence_mask":df_to_csr(sentences["doc_id"],sentences["sentence_idx"]),>>>},>>>"sentence":{>>>"sentence_id":sentences["sentence_id"],>>>"token":df_to_csr(tokens["sentence_id"],tokens["token_idx"],tokens["token"].cat.codes),>>>"token_mask":df_to_csr(tokens["sentence_id"],tokens["token_idx"]),>>>"mention_id":df_to_csr(mentions["sentence_id"],mentions["mention_idx"],mentions["mention_id"]),>>>"mention_mask":df_to_csr(mentions["sentence_id"],mentions["mention_idx"]),>>>},>>>"mention":{>>>"mention_id":mentions["mention_id"],>>>"begin":mentions["begin"],>>>"end":mentions["end"],>>>"category":mentions["category"].cat.codes,>>>},>>>},masks={"sentence":{"mention_id":"mention_mask","token":"token_mask"},"doc":{"sentence_id":"sentence_mask"}})>>>print(batcher)Batcher([doc]:(doc_id):ndarray[int64](792,)(sentence_id):csr_matrix[int64](792,44)(sentence_mask):csr_matrix[bool](792,44)[sentence]:(sentence_id):ndarray[int64](6957,)(token):csr_matrix[int16](6957,211)(token_mask):csr_matrix[bool](6957,211)(mention_id):csr_matrix[int64](6957,13)(mention_mask):csr_matrix[bool](6957,13)[mention]:(mention_id):ndarray[int64](6881,)(begin):ndarray[int64](6881,)(end):ndarray[int64](6881,)(category):ndarray[int8](6881,))>>># Query some documents and convert them to torch>>>batch=batcher["doc"][[3,4,5]].densify(torch.device('cpu'))>>>print(batch)Batcher([doc]:(doc_id):tensor[torch.int64](3,)(sentence_id):tensor[torch.int64](3,9)(sentence_mask):tensor[torch.bool](3,9)(@sentence_id):tensor[torch.int64](3,9)<--indicesrelativetothebatchhavebeencreated(@sentence_mask):tensor[torch.bool](3,9)[sentence]:(sentence_id):tensor[torch.int64](22,)(token):tensor[torch.int64](22,74)(token_mask):tensor[torch.bool](22,74)<--tokentensorhasbeenresizedtoremoveexcesspadtokens(mention_id):tensor[torch.int64](22,3)(mention_mask):tensor[torch.bool](22,3)(@mention_id):tensor[torch.int64](22,3)(@mention_mask):tensor[torch.bool](22,3)[mention]:(mention_id):tensor[torch.int64](22,)(begin):tensor[torch.int64](22,)(end):tensor[torch.int64](22,)(category):tensor[torch.int64](22,))>>># Easily access tensors in the batch>>>print(batch["sentence","token"].shape)torch.Size([22,74])

安装

这个项目仍在开发中,可能会有变化。在

^{pr2}$

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
尝试运行JFLAP。戴软呢帽的罐子23。Java正在抛出异常   无引用的java数组布尔复制   hibernate如何在java SE应用程序中使用JPA EntityManager   java如何使用ORMLite在SQLite中持久化JavaFX属性?   java无法将项目部署到GAE   java:谷歌地图维基百科层   java Resultset(getter/setter类)对象在第二次执行时未删除旧值   s中的java struts2:选择列表>请求的列表键“”作为集合/数组/映射/枚举/迭代器类型   java如何在Karaf 4.0.5中获得BaseDao中的entityManager?   java VSCode未从控制台读取西里尔文   java字体。createFromAsset()返回字体的空指针异常   java错误:将Android Studio从0.6.1更新到0.8.9后,没有合适的构造函数