Python subword-nmt包_程序模块 - PyPI

用于神经机器翻译和文本生成的无监督分词

subword-nmt的Python项目详细描述

H1>子词神经机器翻译< /H1>

此存储库包含将文本分割为子单词的预处理脚本单位。主要目的是促进我们实验的复制用子字单位进行神经机器翻译（见下文以供参考）。

安装

通过pip安装（来自pypi）：

pip install subword-nmt

通过pip（从github）安装：

pip install https://github.com/rsennrich/subword-nmt/archive/master.zip

或者，克隆此存储库；脚本是独立可执行的。

使用说明

检查各个文件的使用说明。

要将字节对编码应用于分词，请调用以下命令：

subword-nmt learn-bpe -s {num_operations} < {train_file} > {codes_file}
subword-nmt apply-bpe -c {codes_file} < {test_file} > {out_file}

要将稀有单词分割成字符n-grams，请执行以下操作：

subword-nmt get-vocab --train_file {train_file} --vocab_file {vocab_file}
subword-nmt segment-char-ngrams --vocab {vocab_file} -n {order} --shortlist {size} < {test_file} > {out_file}

可以通过简单的替换恢复原始分段：

sed -r 's/(@@ )|(@@ ?$)//g'

如果克隆了存储库但未安装软件包，也可以将各个命令作为脚本运行：

./subword_nmt/learn_bpe.py -s {num_operations} < {train_file} > {codes_file}

NMT字节对编码的最佳实践建议

我们发现对于共享字母表的语言，在所涉及的（两种或两种以上）语言的连接增加了一致性并减少了在复制/音译名称。

但是，这会引入不希望出现的边缘情况，因为一个单词可能会被分割以一种只有在另一种语言中才能观察到的方式，因此是未知的在测试时间。为了防止这种情况，apply_bpe.py接受--vocabulary和 --vocabulary-threshold选项，以便脚本只生成符号也出现在词汇表中（至少有一些频率）。

要使用此功能，我们建议使用以下方法（假设l1和l2 两种语言：

在训练文本的连接上学习字节对编码，并获取每个字节对的结果词汇表：

cat {train_file}.L1 {train_file}.L2 | subword-nmt learn-bpe -s {num_operations} -o {codes_file}
subword-nmt apply-bpe -c {codes_file} < {train_file}.L1 | subword-nmt get-vocab > {vocab_file}.L1
subword-nmt apply-bpe -c {codes_file} < {train_file}.L2 | subword-nmt get-vocab > {vocab_file}.L2

更方便的是，您可以使用此命令执行相同的操作：

subword-nmt learn-joint-bpe-and-vocab --input {train_file}.L1 {train_file}.L2 -s {num_operations} -o {codes_file} --write-vocabulary {vocab_file}.L1 {vocab_file}.L2

使用词汇过滤器重新应用字节对编码：

subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L1 --vocabulary-threshold 50 < {train_file}.L1 > {train_file}.BPE.L1
subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L2 --vocabulary-threshold 50 < {train_file}.L2 > {train_file}.BPE.L2

最后一步，提取神经网络要使用的词汇。Nematus示例：

nematus/data/build_dictionary.py {train_file}.BPE.L1 {train_file}.BPE.L2

[您可能希望使用所有词汇表的联合来支持多语言系统]

对于测试/开发数据，重复使用相同的一致性选项：

subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L1 --vocabulary-threshold 50 < {test_file}.L1 > {test_file}.BPE.L1

出版物

分割方法在：

Rico Sennrich、Barry Haddow和Alexandra Birch（2016年）：具有子词单位的罕见词的神经机器翻译计算语言学协会第54届年会论文集（ACL 2016）。德国柏林。

致谢

该项目获得了三星电子波兰研发中心（Samsung Electronics Polska SP.Z O.O.-Samsung R&D Institute Poland）和欧盟《地平线2020研究与创新计划》（Horizon 2020 Research and Innovation Program）645452（Qt21）的资助。

更改日志

V0.3.6:

修正子词bpe命令编码

V0.3.5:

修复python 2下的子单词bpe命令
更广泛地支持--total symbols参数

V0.3.4:

段标记方法，以提高库可用性（https://github.com/rsennrich/subword-nmt/pull/52）
支持regex词汇表（https://github.com/rsennrich/subword-nmt/pull/56）
允许Unicode分隔符（https://github.com/rsennrich/subword-nmt/pull/57）
新选项--学习BPE中的总符号（提交61AD8）
修复文档（最佳实践）（https://github.com/rsennrich/subword-nmt/pull/60）

第0.3版：

库现在可以通过pip安装
修正学习bpe和应用bpe中utf-8空白和新行的偶发问题。
- 不要自动将UTF-8换行符转换为“\n”
- 不要自动将utf-8空白字符转换为“
- utf-8空格和换行符现在被视为单词的一部分，并由bpe分段

第0.2版：

对词尾标记（commit a749a7）的不同、更一致的处理（https://github.com/rsennrich/subword-nmt/issues/19）
一个允许通过词汇和频率阈值来应用bpe.py，防止生成oov（或稀有）子词单元（commit a00db）
使learn_bpe.py确定性（提交4C54E）
在python版本之间进行各种更改以使utf的处理更加一致
apply\u bpe.py的新命令行参数：
- “--glossaries”以防止给定字符串受到bpe的影响
- '--merges'应用学习的bpe操作的子集
learn_bpe.py的新命令行参数：
- “--dict input”：将输入解释为频率字典（由get-vocab.py创建），而不是原始文本文件。

v0.1:

一致的跨版本Unicode处理
所有脚本现在都是确定性的

欢迎加入QQ群-->： 979659372

subword-nmt 0.3.6

subword-nmt的Python项目详细描述

安装

使用说明

NMT字节对编码的最佳实践建议

出版物

致谢

更改日志

推荐PyPI第三方库

pynxhu

feature-formatter

chirpsdk-offline

MyPipModule

VariableSelection

cognite-air-ds-util

cframe

random-publication-test

flake8loggingformat

poetry-demo-truong

djangojstemplate

udtee

MLProcessFlow

cloudwright-pagerduty-events

invenioformatter

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

subword-nmt 0.3.6

subword-nmt的Python项目详细描述

安装

使用说明

NMT字节对编码的最佳实践建议

出版物

致谢

更改日志

推荐PyPI第三方库

pynxhu

feature-formatter

chirpsdk-offline

MyPipModule

VariableSelection

cognite-air-ds-util

cframe

random-publication-test

flake8loggingformat

poetry-demo-truong

djangojstemplate

udtee

MLProcessFlow

cloudwright-pagerduty-events

invenioformatter

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签