如何在pytorch中为机器翻译任务加载torchtext数据集？

# creating the fields SRC = data.Field( tokenize = tokenize_de, lower= True, init_token = "<sos>", eos_token = "<eos>" ) TRG = data.Field( tokenize = tokenize_en, lower= True, init_token = "<sos>", eos_token = "<eos>" ) ### Splitting the sets train_data, valid_data, test_data = datasets.Multi30k.splits( exts=('.de', '.en'), fields = (SRC, TRG) )

{'src': ['zwei', 'junge', 'weiße', 'männer', 'sind', 'im', 'freien', 'in', 'der', 'nähe', 'vieler', 'büsche', '.'], 'trg': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}

from torchtext.datasets import IWSLT2017 train_iter, valid_iter, test_iter = IWSLT2017( root='.data', split=('train', 'valid', 'test'), language_pair=('it', 'en') ) src_sentence, tgt_sentence = next(train_iter)

('Sono impressionato da questa conferenza, e voglio ringraziare tutti voi per i tanti, lusinghieri commenti, anche perché... Ne ho bisogno!!!\n', 'I have been blown away by this conference, and I want to thank all of you for the many nice comments\n')

1条回答

网友

1楼 · 发布于 2024-10-01 07:11:07

为此，您可以使用spacy的processing_pipeline作为示例。一个例子如下所示：

import spacy
from torchtext.data.utils import get_tokenizer
from torchtext.datasets import IWSLT2017

train_iter, valid_iter, test_iter = IWSLT2017(root='.data', split=('train', 'valid', 'test'), language_pair=('it', 'en'))

src_sentence, tgt_sentence = next(train_iter)
print(src_sentence,tgt_sentence)

nlp = spacy.load("it_core_news_sm")
for doc in nlp.pipe([src_sentence]):
    # Do something with the doc here
    print([(ent.text) for ent in doc])

nlp = spacy.load("en_core_web_sm")
for doc in nlp.pipe([tgt_sentence]):
    # Do something with the doc here
    print([(ent.text) for ent in doc])

第一个示例句子的输出：

Grazie mille, Chris. E’ veramente un grande onore venire su questo palco due volte. Vi sono estremamente grato.
Thank you so much, Chris. And it's truly a great honor to have the opportunity to come to this stage twice; I'm extremely grateful.

标记化句子的输出：

['Grazie', 'mille', ',', 'Chris', '.', 'E', '’', 'veramente', 'un', 'grande', 'onore', 'venire', 'su', 'questo', 'palco', 'due', 'volte', '.', 'Vi', 'sono', 'estremamente', 'grato', '.', '\n']
['Thank', 'you', 'so', 'much', ',', 'Chris', '.', 'And', 'it', "'s", 'truly', 'a', 'great', 'honor', 'to', 'have', 'the', 'opportunity', 'to', 'come', 'to', 'this', 'stage', 'twice', ';', 'I', "'m", 'extremely', 'grateful', '.', '\n']

相关问题更多 >

编程相关推荐

热门问题

热门文章