此摘要生成器试图利用字节对编码(BPE)标记化和Bart词汇表来根据语义意义过滤文本。
bpe-summarizer的Python项目详细描述
BPE摘要生成器
此摘要生成器尝试利用字节对编码(BPE)标记化和Bart词汇表按语义意义过滤文本。在
BPE文本表示是一种子词级的标记化方法,其目的是在保持语义值的同时有效地重用部分单词。在
该算法基于n元对的频率。更频繁的对用更大的令牌表示。在
本项目探讨了一个假设,即标记大小与语义意义密切相关。这种摘要方法旨在通过比较标记值和保留原始文本中包含特定百分位内有意义标记的句子来显示最有意义的句子。在
安装
pip install bpe-summarizer
使用
^{pr2}$参数
Parameter | Definition | Default | Type |
---|---|---|---|
^{ | A text blob with sentences delineated by punctuation | ^{ | ^{ |
^{ | Sentences that include tokens in the top kth percentile will remain after summarization | ^{ | ^{ |
^{ | A huggingface ^{ | ^{ | ^{ |
^{ | If ^{ | ^{ | ^{ |
^{ | When ^{ | ^{ | ^{ |
- 注意:
intra_sentence_percentile
如果其值小于令牌平均值的百分位分数,则忽略该值,否则使用平均值的百分位分数。
示例
人类摘要
Building Deep Dependency Structures Using A Wide-Coverage CCG Parser
This paper describes a wide-coverage statistical parser that uses Combinatory Categorial Grammar (CCG) to derive dependency structures.
The parser differs from most existing wide-coverage treebank parsers in capturing the long-range dependencies inherent in constructions such as coordination, extraction, raising and control, as well as the standard local predicate-argument dependencies.
A set of dependency structures used for training and testing the parser is obtained from a treebank of CCG normal-form derivations, which have been derived (semi-) automatically from the Penn Treebank.\nThe parser correctly recovers over 80% of labelled dependencies, and around 90% of unlabelled dependencies.
We provide examples showing how heads can fill dependency slots during a derivation, and how long-range dependencies can be recovered through unification of co-indexed head variables.
We define predicate argument structure for CCG in terms of the dependencies that hold between words with lexical functor categories and their arguments.\n
BPE摘要
Building Deep Dependency Structures Using A Wide-Coverage CCG Parser
This paper describes a wide-coverage statistical parser that uses Combinatory Categorial Grammar (CCG) to derive dependency structures.
The parser differs from most existing wide-coverage treebank parsers in capturing the long-range dependencies inherent in constructions such as coordination, extraction, raising and control, as well as the standard local predicate-argument dependencies.
A set of dependency structures used for training and testing the parser is obtained from a treebank of CCG normal-form derivations, which have been derived (semi-) automatically from the Penn Treebank. The parser correctly recovers over 80% of labelled dependencies, and around 90% of unlabelled dependencies. However, the dependencies are typically derived from a context-free phrase structure.
评价
为了评估摘要的质量,我们使用semantic similarity metric,将自动摘要的示例与来自scisummnet dataset的人工摘要进行比较。文本用sentence-level embeddings表示。图1。将BPE摘要生成器的结果与widely used摘要技术进行比较。它在100个样本中进行了竞争,在百分之一秒内完成了总结,而在55秒的时间内完成了总结。在
<;small>;图1。使用广泛使用的摘要生成器进行评估<;/small>
<;small>;*性能评估是使用CPU完成的,而竞争性技术则是在拆分为仅使用summarization component之后应用的。<;/small>
引用:
- Language Models are Unsupervised Multitask Learners, Radford, et.al
- Huggingface/GPT Tokenizer
- GPT-2/Encoder
- Comparing Transformers and Tokenizers, Németh
- Huggingface Bart Summarization Pipeline
- 项目
标签: