欢迎来到尧图网

客户服务 关于我们

您的位置:首页 > 文旅 > 八卦 > transformers datasets

transformers datasets

2024/10/24 23:24:17 来源:https://blog.csdn.net/qq_41685627/article/details/139899877  浏览:    关键词:transformers datasets

☆ 问题描述

在进行自然语言处理项目时,经常需要加载和处理不同的数据集。为了简化这一过程,我们可以使用datasets库来方便地加载、切分、查看和处理数据。本解决方案提供了如何使用datasets库加载、查看和处理数据的详细示例,包括如何加载在线数据集、切分数据集、选择和过滤数据、数据映射和保存等操作。

★ 解决方案

# load online datasets
datasets = load_dataset("madao33/new-title-chinese")
datasets
#>>>DatasetDict({
#>>>    train: Dataset({
#>>>        features: ['title', 'content'],
#>>>        num_rows: 5850
#>>>    })
#>>>    validation: Dataset({
#>>>        features: ['title', 'content'],
#>>>        num_rows: 1679
#>>>    })
#>>>})# load a task in the datasets
boolq_dataset = load_dataset("super_glue", "boolq")# load according to datasets partitioning
dataset = load_dataset("madao33/new-title-chinese", split="train")# load slices of the dataset
dataset = load_dataset("madao33/new-title-chinese", split="train[10:100]")
#or 
dataset = load_dataset("madao33/new-title-chinese", split="train[:50%]")# load datasets as a list
dataset = load_dataset("madao33/new-title-chinese", split=["train[:50%]", "train[50%:]"])
#>>>[Dataset({
#>>>     features: ['title', 'content'],
#>>>     num_rows: 2925
#>>> }),
#>>> Dataset({
#>>>     features: ['title', 'content'],
#>>>     num_rows: 2925
#>>> })]# View a piece of train data
datasets["train"][0]# View some pieces of train data
datasets["train"][:2]# View some pieces of train title data
datasets["train"]["title"][:5]# view cols of train data
datasets["train"].column_names# dataset split
dataset = datasets["train"]
dataset.train_test_split(test_size=0.1)# 
dataset.train_test_split(test_size=0.1, stratify_by_column="label")   # data select
datasets["train"].select([0, 1])# data filter
filter_dataset = datasets["train"].filter(lambda example: "中国" in example["title"])# data mapping 
def add_prefix(example):example["title"] = 'Prefix: ' + example["title"]return example
prefix_dataset = datasets.map(add_prefix)
prefix_dataset["train"][:10]["title"]# data save 
processed_datasets.save_to_disk("./processed_data")# data load
processed_datasets = load_from_disk("./processed_data")# load datasets from csv
dataset = load_dataset("csv", data_files="./ChnSentiCorp_htl_all.csv", split="train")# Other data loading methods
import pandas as pd
data = pd.read_csv("./ChnSentiCorp_htl_all.csv")
dataset = Dataset.from_pandas(data)#
load_dataset("json", data_files="./cmrc2018_trial.json", field="data")#
dataset = load_dataset("./load_script.py", split="train")

版权声明:

本网仅为发布的内容提供存储空间,不对发表、转载的内容提供任何形式的保证。凡本网注明“来源:XXX网络”的作品,均转载自其它媒体,著作权归作者所有,商业转载请联系作者获得授权,非商业转载请注明出处。

我们尊重并感谢每一位作者,均已注明文章来源和作者。如因作品内容、版权或其它问题,请及时与我们联系,联系邮箱:809451989@qq.com,投稿邮箱:809451989@qq.com