
2024-09-27 23:03:30 发布

您现在位置:Python中文网/ 问答频道 /正文

from datasets import load_dataset #Huggingface
from transformers import BertTokenizer #Huggingface:

def tokenized_dataset(dataset):
    """ Method that tokenizes each document in the train, test and validation dataset

        dataset (DatasetDict): dataset that will be tokenized (train, test, validation)
        dict: dataset once tokenized

    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

    encode = lambda document: tokenizer(document, return_tensors='pt', padding=True, truncation=True)
    train_articles = [encode(document) for document in dataset["train"]["article"]]
    test_articles = [encode(document) for document in dataset["test"]["article"]]
    val_articles = [encode(document) for document in dataset["val"]["article"]]
    train_abstracts = [encode(document) for document in dataset["train"]["abstract"]]
    test_abstracts = [encode(document) for document in dataset["test"]["abstract"]]
    val_abstracts = [encode(document) for document in dataset["val"]["abstract"]]

    return {"train": (train_articles, train_abstracts),
            "test": (test_articles, test_abstracts),
            "val": (val_articles, val_abstracts)}

if __name__ == "__main__":
    dataset = load_data("./train/", "./test/", "./val/", "./.cache_dir")
    tokenized_data = tokenized_dataset(dataset)



[['eleven politicians from 7 parties made comments in letter to a newspaper .',
  "said dpp alison saunders had ` damaged public confidence ' in justice .",
  'ms saunders ruled lord janner unfit to stand trial over child abuse claims .',
  'the cps has pursued at least 19 suspected paedophiles with dementia .'],
 ['an increasing number of surveys claim to reveal what makes us happiest .',
  'but are these generic lists really of any use to us ?',
  'janet street-porter makes her own list - of things making her unhappy !'],
 ["author of ` into the wild ' spoke to five rape victims in missoula , montana .",
  "` missoula : rape and the justice system in a college town ' was released april 21 .",
  "three of five victims profiled in the book sat down with abc 's nightline wednesday night .",
  'kelsey belnap , allison huguet and hillary mclaughlin said they had been raped by university of montana football '
  'players .',
  "huguet and mclaughlin 's attacker , beau donaldson , pleaded guilty to rape in 2012 and was sentenced to 10 years .",
  'belnap claimed four players gang-raped her in 2010 , but prosecutors never charged them citing lack of probable '
  'cause .',
  'mr krakauer wrote book after realizing close friend was a rape victim .'],
 ['tesco announced a record annual loss of £ 6.38 billion yesterday .',
  'drop in sales , one-off costs and pensions blamed for financial loss .',
  'supermarket giant now under pressure to close 200 stores nationwide .',
  'here , retail industry veterans , plus mail writers , identify what went wrong .'],
 ['snp leader said alex salmond did not field questions over his family .',
  "said she was not ` moaning ' but also attacked criticism of women 's looks .",
  'she made the remarks in latest programme profiling the main party leaders .',
  'ms sturgeon also revealed her tv habits and recent image makeover .',
  'she said she relaxed by eating steak and chips on a saturday night .']]

所以在字典中,键只是字符串,但值都是字符串列表。与其让value=list of string,不如创建一个对象函数列表,而不是让list of string。这将使字典更轻。我该怎么做



Tags: andofthetointestfortrain
