Print only the data from a form là gì năm 2024

Question

Documentaries (though they often include photos or video portions that can be considered primary sources).

Nội dung chính Show

When is a Primary Source a Secondary Source?
Dataset Types
Map-style datasets
Iterable-style datasets
Data Loading Order and
Loading Batched and Non-Batched Data
Automatic batching (default)
Disable automatic batching
Working with
Single- and Multi-process Data Loading
Single-process data loading (default)
Multi-process data loading
Memory Pinning

When is a Primary Source a Secondary Source?

Whether something is a primary or secondary source often depends upon the topic and its use.

A biology textbook would be considered a secondary source if in the field of biology, since it describes and interprets the science but makes no original contribution to it.

On the other hand, if the topic is science education and the history of textbooks, textbooks could be used a primary sources to look at how they have changed over time.

At the heart of PyTorch data loading utility is the class. It represents a Python iterable over a dataset, with support for

,
,
,
,
.

These options are configured by the constructor arguments of a , which has signature:

DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,

       batch_sampler=None, num_workers=0, collate_fn=None,
       pin_memory=False, drop_last=False, timeout=0,
       worker_init_fn=None, *, prefetch_factor=2,
       persistent_workers=False)

The sections below describe in details the effects and usages of these options.

Dataset Types

The most important argument of constructor is , which indicates a dataset object to load data from. PyTorch supports two different types of datasets:

,
.

Map-style datasets

A map-style dataset is one that implements the

dataset_iter = iter(dataset) for indices in batch_sampler:

yield collate_fn([next(dataset_iter) for _ in indices])

1 and

dataset_iter = iter(dataset) for indices in batch_sampler:

yield collate_fn([next(dataset_iter) for _ in indices])

2 protocols, and represents a map from (possibly non-integral) indices/keys to data samples.

For example, such a dataset, when accessed with

dataset_iter = iter(dataset) for indices in batch_sampler:

yield collate_fn([next(dataset_iter) for _ in indices])

3, could read the

dataset_iter = iter(dataset) for indices in batch_sampler:

yield collate_fn([next(dataset_iter) for _ in indices])

4-th image and its corresponding label from a folder on the disk.

See for more details.

Iterable-style datasets

An iterable-style dataset is an instance of a subclass of that implements the

dataset_iter = iter(dataset) for indices in batch_sampler:

yield collate_fn([next(dataset_iter) for _ in indices])

7 protocol, and represents an iterable over data samples. This type of datasets is particularly suitable for cases where random reads are expensive or even improbable, and where the batch size depends on the fetched data.

For example, such a dataset, when called

dataset_iter = iter(dataset) for indices in batch_sampler:

yield collate_fn([next(dataset_iter) for _ in indices])

8, could return a stream of data reading from a database, a remote server, or even logs generated in real time.

See for more details.

Data Loading Order and

For , data loading order is entirely controlled by the user-defined iterable. This allows easier implementations of chunk-reading and dynamic batch size (e.g., by yielding a batched sample at each time).

The rest of this section concerns the case with . classes are used to specify the sequence of indices/keys used in data loading. They represent iterable objects over the indices to datasets. E.g., in the common case with stochastic gradient decent (SGD), a could randomly permute a list of indices and yield each one at a time, or yield a small number of them for mini-batch SGD.

A sequential or shuffled sampler will be automatically constructed based on the

for index in sampler:

yield collate_fn(dataset[index])

3 argument to a . Alternatively, users may use the argument to specify a custom object that at each time yields the next index/key to fetch.

A custom that yields a list of batch indices at a time can be passed as the

for index in sampler:

yield collate_fn(dataset[index])

8 argument. Automatic batching can also be enabled via

for index in sampler:

yield collate_fn(dataset[index])

9 and

for data in iter(dataset):

yield collate_fn(data)

0 arguments. See for more details on this.

Note

Neither nor

for index in sampler:

yield collate_fn(dataset[index])

8 is compatible with iterable-style datasets, since such datasets have no notion of a key or an index.

Loading Batched and Non-Batched Data

supports automatically collating individual fetched data samples into batches via arguments

for index in sampler:

yield collate_fn(dataset[index])

9,

for data in iter(dataset):

yield collate_fn(data)

0,

for index in sampler:

yield collate_fn(dataset[index])

8, and

for data in iter(dataset):

yield collate_fn(data)

7 (which has a default function).

Automatic batching (default)

This is the most common case, and corresponds to fetching a minibatch of data and collating them into batched samples, i.e., containing Tensors with one dimension being the batch dimension (usually the first).

When

for index in sampler:

yield collate_fn(dataset[index])

9 (default

for data in iter(dataset):

yield collate_fn(data)

is not

class SimpleCustomBatch:

def __init__(self, data):
    transposed_data = list(zip(*data))
    self.inp = torch.stack(transposed_data[0], 0)
    self.tgt = torch.stack(transposed_data[1], 0)
# custom memory pinning method on custom type
def pin_memory(self):
    self.inp = self.inp.pin_memory()
    self.tgt = self.tgt.pin_memory()
    return self

def collate_wrapper(batch):

return SimpleCustomBatch(batch)

inps = torch.arange(10 5, dtype=torch.float32).view(10, 5) tgts = torch.arange(10 5, dtype=torch.float32).view(10, 5) dataset = TensorDataset(inps, tgts) loader = DataLoader(dataset, batch_size=2, collate_fn=collate_wrapper,

                pin_memory=True)

for batch_ndx, sample in enumerate(loader):

print(sample.inp.is_pinned())
print(sample.tgt.is_pinned())

0, the data loader yields batched samples instead of individual samples.

for index in sampler:

yield collate_fn(dataset[index])

9 and

for data in iter(dataset):

yield collate_fn(data)

0 arguments are used to specify how the data loader obtains batches of dataset keys. For map-style datasets, users can alternatively specify

for index in sampler:

yield collate_fn(dataset[index])

8, which yields a list of keys at a time.

Note

The

for index in sampler:

yield collate_fn(dataset[index])

9 and

for data in iter(dataset):

yield collate_fn(data)

0 arguments essentially are used to construct a

for index in sampler:

yield collate_fn(dataset[index])

8 from . For map-style datasets, the is either provided by user or constructed based on the

for index in sampler:

yield collate_fn(dataset[index])

3 argument. For iterable-style datasets, the is a dummy infinite one. See on more details on samplers.

After fetching a list of samples using the indices from sampler, the function passed as the

for data in iter(dataset):

yield collate_fn(data)

7 argument is used to collate lists of samples into batches.

In this case, loading from a map-style dataset is roughly equivalent with:

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

and loading from an iterable-style dataset is roughly equivalent with:

dataset_iter = iter(dataset) for indices in batch_sampler:

yield collate_fn([next(dataset_iter) for _ in indices])

A custom

for data in iter(dataset):

yield collate_fn(data)

7 can be used to customize collation, e.g., padding sequential data to max length of a batch. See on more about

for data in iter(dataset):

yield collate_fn(data)

7.

Disable automatic batching

In certain cases, users may want to handle batching manually in dataset code, or simply load individual samples. For example, it could be cheaper to directly load batched data (e.g., bulk reads from a database or reading continuous chunks of memory), or the batch size is data dependent, or the program is designed to work on individual samples. Under these scenarios, it’s likely better to not use automatic batching (where

for data in iter(dataset):

yield collate_fn(data)

7 is used to collate the samples), but let the data loader directly return each member of the object.

When both

for index in sampler:

yield collate_fn(dataset[index])

9 and

for index in sampler:

yield collate_fn(dataset[index])

8 are

class SimpleCustomBatch:

def __init__(self, data):
    transposed_data = list(zip(*data))
    self.inp = torch.stack(transposed_data[0], 0)
    self.tgt = torch.stack(transposed_data[1], 0)
# custom memory pinning method on custom type
def pin_memory(self):
    self.inp = self.inp.pin_memory()
    self.tgt = self.tgt.pin_memory()
    return self

def collate_wrapper(batch):

return SimpleCustomBatch(batch)

inps = torch.arange(10 5, dtype=torch.float32).view(10, 5) tgts = torch.arange(10 5, dtype=torch.float32).view(10, 5) dataset = TensorDataset(inps, tgts) loader = DataLoader(dataset, batch_size=2, collate_fn=collate_wrapper,

                pin_memory=True)

for batch_ndx, sample in enumerate(loader):

print(sample.inp.is_pinned())
print(sample.tgt.is_pinned())

0 (default value for

for index in sampler:

yield collate_fn(dataset[index])

8 is already

class SimpleCustomBatch:

def __init__(self, data):
    transposed_data = list(zip(*data))
    self.inp = torch.stack(transposed_data[0], 0)
    self.tgt = torch.stack(transposed_data[1], 0)
# custom memory pinning method on custom type
def pin_memory(self):
    self.inp = self.inp.pin_memory()
    self.tgt = self.tgt.pin_memory()
    return self

def collate_wrapper(batch):

return SimpleCustomBatch(batch)

inps = torch.arange(10 5, dtype=torch.float32).view(10, 5) tgts = torch.arange(10 5, dtype=torch.float32).view(10, 5) dataset = TensorDataset(inps, tgts) loader = DataLoader(dataset, batch_size=2, collate_fn=collate_wrapper,

                pin_memory=True)

for batch_ndx, sample in enumerate(loader):

print(sample.inp.is_pinned())
print(sample.tgt.is_pinned())

0), automatic batching is disabled. Each sample obtained from the is processed with the function passed as the

for data in iter(dataset):

yield collate_fn(data)

7 argument.

When automatic batching is disabled, the default

for data in iter(dataset):

yield collate_fn(data)

7 simply converts NumPy arrays into PyTorch Tensors, and keeps everything else untouched.

In this case, loading from a map-style dataset is roughly equivalent with:

for index in sampler:

yield collate_fn(dataset[index])

and loading from an iterable-style dataset is roughly equivalent with:

for data in iter(dataset):

yield collate_fn(data)

See on more about

for data in iter(dataset):

yield collate_fn(data)

7.

Working with

for data in iter(dataset):

yield collate_fn(data)

7

The use of

for data in iter(dataset):

yield collate_fn(data)

7 is slightly different when automatic batching is enabled or disabled.

When automatic batching is disabled,

for data in iter(dataset):

yield collate_fn(data)

7 is called with each individual data sample, and the output is yielded from the data loader iterator. In this case, the default

for data in iter(dataset):

yield collate_fn(data)

7 simply converts NumPy arrays in PyTorch tensors.

When automatic batching is enabled,

for data in iter(dataset):

yield collate_fn(data)

7 is called with a list of data samples at each time. It is expected to collate the input samples into a batch for yielding from the data loader iterator. The rest of this section describes the behavior of the default

for data in iter(dataset):

yield collate_fn(data)

7 ().

For instance, if each data sample consists of a 3-channel image and an integral class label, i.e., each element of the dataset returns a tuple

2, the default

for data in iter(dataset):

yield collate_fn(data)

7 collates a list of such tuples into a single tuple of a batched image tensor and a batched class label Tensor. In particular, the default

for data in iter(dataset):

yield collate_fn(data)

7 has the following properties:

It always prepends a new dimension as the batch dimension.
It automatically converts NumPy arrays and Python numerical values into PyTorch Tensors.
It preserves the data structure, e.g., if each sample is a dictionary, it outputs a dictionary with the same set of keys but batched Tensors as values (or lists if the values can not be converted into Tensors). Same for
> images = ImageDataset()
texts = TextDataset()
tuple_stack = StackDataset(images, texts) tuple_stack[0] == (images[0], texts[0]) dict_stack = StackDataset(image=images, text=texts) dict_stack[0] == {'image': images[0], 'text': texts[0]} 5 s, images = ImageDataset() texts = TextDataset() tuple_stack = StackDataset(images, texts) tuple_stack[0] == (images[0], texts[0]) dict_stack = StackDataset(image=images, text=texts) dict_stack[0] == {'image': images[0], 'text': texts[0]} 6 s, images = ImageDataset() texts = TextDataset() tuple_stack = StackDataset(images, texts) tuple_stack[0] == (images[0], texts[0]) dict_stack = StackDataset(image=images, text=texts) dict_stack[0] == {'image': images[0], 'text': texts[0]} 7 s, etc.

Users may use customized

for data in iter(dataset):

yield collate_fn(data)

7 to achieve custom batching, e.g., collating along a dimension other than the first, padding sequences of various lengths, or adding support for custom data types.

If you run into a situation where the outputs of have dimensions or type that is different from your expectation, you may want to check your

for data in iter(dataset):

yield collate_fn(data)

7.

Single- and Multi-process Data Loading

A uses single-process data loading by default.

Within a Python process, the Global Interpreter Lock (GIL) prevents true fully parallelizing Python code across threads. To avoid blocking computation code with data loading, PyTorch provides an easy switch to perform multi-process data loading by simply setting the argument

2 to a positive integer.

Single-process data loading (default)

In this mode, data fetching is done in the same process a is initialized. Therefore, data loading may block computing. However, this mode may be preferred when resource(s) used for sharing data among processes (e.g., shared memory, file descriptors) is limited, or when the entire dataset is small and can be loaded entirely in memory. Additionally, single-process loading often shows more readable error traces and thus is useful for debugging.

Multi-process data loading

Setting the argument

2 as a positive integer will turn on multi-process data loading with the specified number of loader worker processes.

Warning

After several iterations, the loader worker processes will consume the same amount of CPU memory as the parent process for all Python objects in the parent process which are accessed from the worker processes. This can be problematic if the Dataset contains a lot of data (e.g., you are loading a very large list of filenames at Dataset construction time) and/or you are using a lot of workers (overall memory usage is

5). The simplest workaround is to replace Python objects with non-refcounted representations such as Pandas, Numpy or PyArrow objects. Check out for more details on why this occurs and example code for how to workaround these problems.

In this mode, each time an iterator of a is created (e.g., when you call

7),

2 worker processes are created. At this point, the ,

for data in iter(dataset):

yield collate_fn(data)

7, and

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

01 are passed to each worker, where they are used to initialize, and fetch data. This means that dataset access together with its internal IO, transforms (including

for data in iter(dataset):

yield collate_fn(data)

runs in the worker process.

returns various useful information in a worker process (including the worker id, dataset replica, initial seed, etc.), and returns

class SimpleCustomBatch:

def __init__(self, data):
    transposed_data = list(zip(*data))
    self.inp = torch.stack(transposed_data[0], 0)
    self.tgt = torch.stack(transposed_data[1], 0)
# custom memory pinning method on custom type
def pin_memory(self):
    self.inp = self.inp.pin_memory()
    self.tgt = self.tgt.pin_memory()
    return self

def collate_wrapper(batch):

return SimpleCustomBatch(batch)

inps = torch.arange(10 5, dtype=torch.float32).view(10, 5) tgts = torch.arange(10 5, dtype=torch.float32).view(10, 5) dataset = TensorDataset(inps, tgts) loader = DataLoader(dataset, batch_size=2, collate_fn=collate_wrapper,

                pin_memory=True)

for batch_ndx, sample in enumerate(loader):

print(sample.inp.is_pinned())
print(sample.tgt.is_pinned())

0 in main process. Users may use this function in dataset code and/or

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

01 to individually configure each dataset replica, and to determine whether the code is running in a worker process. For example, this can be particularly helpful in sharding the dataset.

For map-style datasets, the main process generates the indices using and sends them to the workers. So any shuffle randomization is done in the main process which guides loading by assigning indices to load.

For iterable-style datasets, since each worker process gets a replica of the object, naive multi-process loading will often result in duplicated data. Using and/or

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

01, users may configure each replica independently. (See documentations for how to achieve this. ) For similar reasons, in multi-process loading, the

for data in iter(dataset):

yield collate_fn(data)

0 argument drops the last non-full batch of each worker’s iterable-style dataset replica.

Workers are shut down once the end of the iteration is reached, or when the iterator becomes garbage collected.

Warning

It is generally not recommended to return CUDA tensors in multi-process loading because of many subtleties in using CUDA and sharing CUDA tensors in multiprocessing (see ). Instead, we recommend using (i.e., setting

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

12), which enables fast data transfer to CUDA-enabled GPUs.

Platform-specific behaviors

Since workers rely on Python , worker launch behavior is different on Windows compared to Unix.

On Unix, for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
14 is the default start method. Using for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
14, child workers typically can access the and Python argument functions directly through the cloned address space.
On Windows or MacOS, for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
18 is the default start method. Using for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
18, another interpreter is launched which runs your main script, followed by the internal worker function that receives the , for data in iter(dataset):
```
yield collate_fn(data)  
```
7 and other arguments through serialization.

This separate serialization means that you should take two steps to ensure you are compatible with Windows while using multi-process data loading:

Wrap most of you main script’s code within for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
24 block, to make sure it doesn’t run again (most likely generating error) when each worker process is launched. You can place your dataset and instance creation logic here, as it doesn’t need to be re-executed in workers.
Make sure that any custom for data in iter(dataset):
```
yield collate_fn(data)  
```
7, for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
01 or code is declared as top level definitions, outside of the for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
29 check. This ensures that they are available in worker processes. (this is needed since functions are pickled as references only, not for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
30.)

Randomness in multi-process data loading

By default, each worker will have its PyTorch seed set to

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

31, where

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

32 is a long generated by main process using its RNG (thereby, consuming a RNG state mandatorily) or a specified

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

33. However, seeds for other libraries may be duplicated upon initializing workers, causing each worker to return identical random numbers. (See in FAQ.).

In

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

01, you may access the PyTorch seed set for each worker with either or , and use it to seed other libraries before data loading.

Memory Pinning

Host to GPU copies are much faster when they originate from pinned (page-locked) memory. See for more details on when and how to use pinned memory generally.

For data loading, passing

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

12 to a will automatically put the fetched data Tensors in pinned memory, and thus enables faster data transfer to CUDA-enabled GPUs.

The default memory pinning logic only recognizes Tensors and maps and iterables containing Tensors. By default, if the pinning logic sees a batch that is a custom type (which will occur if you have a

for data in iter(dataset):

yield collate_fn(data)

7 that returns a custom batch type), or if each element of your batch is a custom type, the pinning logic will not recognize them, and it will return that batch (or those elements) without pinning the memory. To enable memory pinning for custom batch or data type(s), define a

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

40 method on your custom type(s).

See the example below.

Example:

class SimpleCustomBatch:

def __init__(self, data):
    transposed_data = list(zip(*data))
    self.inp = torch.stack(transposed_data[0], 0)
    self.tgt = torch.stack(transposed_data[1], 0)
# custom memory pinning method on custom type
def pin_memory(self):
    self.inp = self.inp.pin_memory()
    self.tgt = self.tgt.pin_memory()
    return self

def collate_wrapper(batch):

return SimpleCustomBatch(batch)

inps = torch.arange(10 5, dtype=torch.float32).view(10, 5) tgts = torch.arange(10 5, dtype=torch.float32).view(10, 5) dataset = TensorDataset(inps, tgts) loader = DataLoader(dataset, batch_size=2, collate_fn=collate_wrapper,

                pin_memory=True)

for batch_ndx, sample in enumerate(loader):

print(sample.inp.is_pinned())
print(sample.tgt.is_pinned())

class torch.utils.data.DataLoader(dataset, batch_size\=1, shuffle\=None, sampler\=None, batch_sampler\=None, num_workers\=0, collate_fn\=None, pin_memory\=False, drop_last\=False, timeout\=0, worker_init_fn\=None, multiprocessing_context\=None, generator\=None, *, prefetch_factor\=None, persistent_workers\=False, pin_memory_device\='')

Data loader combines a dataset and a sampler, and provides an iterable over the given dataset.

The supports both map-style and iterable-style datasets with single- or multi-process loading, customizing loading order and optional automatic batching (collation) and memory pinning.

See documentation page for more details.

Parameters

dataset () – dataset from which to load the data.
batch_size (, optional) – how many samples per batch to load (default: for data in iter(dataset):
```
yield collate_fn(data)  
```
9).
shuffle (, optional) – set to for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
44 to have the data reshuffled at every epoch (default: for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
45).
sampler ( or Iterable, optional) – defines the strategy to draw samples from the dataset. Can be any for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
46 with for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
47 implemented. If specified, for index in sampler:
```
yield collate_fn(dataset[index])  
```
3 must not be specified.
batch_sampler ( or Iterable, optional) – like , but returns a batch of indices at a time. Mutually exclusive with for index in sampler:
```
yield collate_fn(dataset[index])  
```
9, for index in sampler:
```
yield collate_fn(dataset[index])  
```
3, , and for data in iter(dataset):
```
yield collate_fn(data)  
```
0.
num_workers (, optional) – how many subprocesses to use for data loading. for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
54 means that the data will be loaded in the main process. (default: for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
54)
collate_fn (Callable, optional) – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
pin_memory (, optional) – If for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
44, the data loader will copy Tensors into device/CUDA pinned memory before returning them. If your data elements are a custom type, or your for data in iter(dataset):
```
yield collate_fn(data)  
```
7 returns a batch that is a custom type, see the example below.
drop_last (, optional) – set to for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
44 to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
45 and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
45)
timeout (numeric, optional) – if positive, the timeout value for collecting a batch from workers. Should always be non-negative. (default: for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
54)

worker_init_fn (Callable, optional) – If not class SimpleCustomBatch:

def init(self, data):  
    transposed_data = list(zip(data))  
    self.inp = torch.stack(transposed_data[0], 0)  
    self.tgt = torch.stack(transposed_data[1], 0)  
# custom memory pinning method on custom type  
def pin_memory(self):  
    self.inp = self.inp.pin_memory()  
    self.tgt = self.tgt.pin_memory()  
    return self

def collate_wrapper(batch):
return SimpleCustomBatch(batch)
inps = torch.arange(10 5, dtype=torch.float32).view(10, 5) tgts = torch.arange(10 * 5, dtype=torch.float32).view(10, 5) dataset = TensorDataset(inps, tgts) loader = DataLoader(dataset, batch_size=2, collate_fn=collate_wrapper,

                pin_memory=True)

for batch_ndx, sample in enumerate(loader):

print(sample.inp.is_pinned())  
print(sample.tgt.is_pinned())

0, this will be called on each worker subprocess with the worker id (an int in for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

as input, after seeding and before data loading. (default:

class SimpleCustomBatch:

def init(self, data):  
    transposed_data = list(zip(data))  
    self.inp = torch.stack(transposed_data[0], 0)  
    self.tgt = torch.stack(transposed_data[1], 0)  
# custom memory pinning method on custom type  
def pin_memory(self):  
    self.inp = self.inp.pin_memory()  
    self.tgt = self.tgt.pin_memory()  
    return self

def collate_wrapper(batch):
return SimpleCustomBatch(batch)
inps = torch.arange(10 5, dtype=torch.float32).view(10, 5) tgts = torch.arange(10 * 5, dtype=torch.float32).view(10, 5) dataset = TensorDataset(inps, tgts) loader = DataLoader(dataset, batch_size=2, collate_fn=collate_wrapper,

                pin_memory=True)

for batch_ndx, sample in enumerate(loader):

print(sample.inp.is_pinned())  
print(sample.tgt.is_pinned())

0)

multiprocessing_context ( or multiprocessing.context.BaseContext, optional) – If class SimpleCustomBatch:

def init(self, data):  
    transposed_data = list(zip(data))  
    self.inp = torch.stack(transposed_data[0], 0)  
    self.tgt = torch.stack(transposed_data[1], 0)  
# custom memory pinning method on custom type  
def pin_memory(self):  
    self.inp = self.inp.pin_memory()  
    self.tgt = self.tgt.pin_memory()  
    return self

def collate_wrapper(batch):
return SimpleCustomBatch(batch)
inps = torch.arange(10 5, dtype=torch.float32).view(10, 5) tgts = torch.arange(10 5, dtype=torch.float32).view(10, 5) dataset = TensorDataset(inps, tgts) loader = DataLoader(dataset, batch_size=2, collate_fn=collate_wrapper,
pin_memory=True)
for batch_ndx, sample in enumerate(loader):
print(sample.inp.is_pinned()) print(sample.tgt.is_pinned())
0, the default of your operating system will be used. (default: class SimpleCustomBatch:
def init(self, data): transposed_data = list(zip(*data)) self.inp = torch.stack(transposed_data[0], 0) self.tgt = torch.stack(transposed_data[1], 0) # custom memory pinning method on custom type def pin_memory(self): self.inp = self.inp.pin_memory() self.tgt = self.tgt.pin_memory() return self
def collate_wrapper(batch):
return SimpleCustomBatch(batch)
inps = torch.arange(10 5, dtype=torch.float32).view(10, 5) tgts = torch.arange(10 * 5, dtype=torch.float32).view(10, 5) dataset = TensorDataset(inps, tgts) loader = DataLoader(dataset, batch_size=2, collate_fn=collate_wrapper,

                pin_memory=True)

for batch_ndx, sample in enumerate(loader):

print(sample.inp.is_pinned())  
print(sample.tgt.is_pinned())

0)

generator (, optional) – If not class SimpleCustomBatch:

def init(self, data):  
    transposed_data = list(zip(data))  
    self.inp = torch.stack(transposed_data[0], 0)  
    self.tgt = torch.stack(transposed_data[1], 0)  
# custom memory pinning method on custom type  
def pin_memory(self):  
    self.inp = self.inp.pin_memory()  
    self.tgt = self.tgt.pin_memory()  
    return self

def collate_wrapper(batch):
return SimpleCustomBatch(batch)
inps = torch.arange(10 5, dtype=torch.float32).view(10, 5) tgts = torch.arange(10 5, dtype=torch.float32).view(10, 5) dataset = TensorDataset(inps, tgts) loader = DataLoader(dataset, batch_size=2, collate_fn=collate_wrapper,
pin_memory=True)
for batch_ndx, sample in enumerate(loader):
print(sample.inp.is_pinned()) print(sample.tgt.is_pinned())
0, this RNG will be used by RandomSampler to generate random indexes and multiprocessing to generate for indices in batch_sampler:
yield collate_fn([dataset[i] for i in indices])
32 for workers. (default: class SimpleCustomBatch:
def init(self, data): transposed_data = list(zip(*data)) self.inp = torch.stack(transposed_data[0], 0) self.tgt = torch.stack(transposed_data[1], 0) # custom memory pinning method on custom type def pin_memory(self): self.inp = self.inp.pin_memory() self.tgt = self.tgt.pin_memory() return self
def collate_wrapper(batch):
return SimpleCustomBatch(batch)
inps = torch.arange(10 5, dtype=torch.float32).view(10, 5) tgts = torch.arange(10 * 5, dtype=torch.float32).view(10, 5) dataset = TensorDataset(inps, tgts) loader = DataLoader(dataset, batch_size=2, collate_fn=collate_wrapper,

                pin_memory=True)

for batch_ndx, sample in enumerate(loader):

print(sample.inp.is_pinned())  
print(sample.tgt.is_pinned())

0)

prefetch_factor (, optional, keyword-only arg) – Number of batches loaded in advance by each worker. for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
70 means there will be a total of 2 * num_workers batches prefetched across all workers. (default value depends on the set value for num_workers. If value of num_workers=0 default is class SimpleCustomBatch:
```
def init(self, data):  
    transposed_data = list(zip(data))  
    self.inp = torch.stack(transposed_data[0], 0)  
    self.tgt = torch.stack(transposed_data[1], 0)  
# custom memory pinning method on custom type  
def pin_memory(self):  
    self.inp = self.inp.pin_memory()  
    self.tgt = self.tgt.pin_memory()  
    return self  
```
def collate_wrapper(batch):
return SimpleCustomBatch(batch)
inps = torch.arange(10 5, dtype=torch.float32).view(10, 5) tgts = torch.arange(10 * 5, dtype=torch.float32).view(10, 5) dataset = TensorDataset(inps, tgts) loader = DataLoader(dataset, batch_size=2, collate_fn=collate_wrapper,
```
                pin_memory=True)  
```
for batch_ndx, sample in enumerate(loader):
```
print(sample.inp.is_pinned())  
print(sample.tgt.is_pinned())  
```
0. Otherwise, if value of for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
72 default is for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
70).
persistent_workers (, optional) – If for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
44, the data loader will not shut down the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. (default: for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
45)

pin_memory_device (, optional) – the device to for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

76 to if for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

76 is for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

44.

Warning

If the

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

79 start method is used,

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

01 cannot be an unpicklable object, e.g., a lambda function. See on more details related to multiprocessing in PyTorch.

Warning

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

81 heuristic is based on the length of the sampler used. When is an , it instead returns an estimate based on

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

84, with proper rounding depending on

for data in iter(dataset):

yield collate_fn(data)

0, regardless of multi-process loading configurations. This represents the best guess PyTorch can make because PyTorch trusts user code in correctly handling multi-process loading to avoid duplicate data.

However, if sharding results in multiple workers having incomplete last batches, this estimate can still be inaccurate, because (1) an otherwise complete batch can be broken into multiple ones and (2) more than one batch worth of samples can be dropped when

for data in iter(dataset):

yield collate_fn(data)

0 is set. Unfortunately, PyTorch can not detect such cases in general.

See for more details on these two types of datasets and how interacts with .

class torch.utils.data.Dataset(*args, **kwds)

An abstract class representing a .

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite

dataset_iter = iter(dataset) for indices in batch_sampler:

yield collate_fn([next(dataset_iter) for _ in indices])

1, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite

dataset_iter = iter(dataset) for indices in batch_sampler:

yield collate_fn([next(dataset_iter) for _ in indices])

2, which is expected to return the size of the dataset by many implementations and the default options of . Subclasses could also optionally implement

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

94, for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.

Note

by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

class torch.utils.data.IterableDataset(*args, **kwds)

An iterable Dataset.

All datasets that represent an iterable of data samples should subclass it. Such form of datasets is particularly useful when data come from a stream.

All subclasses should overwrite

dataset_iter = iter(dataset) for indices in batch_sampler:

yield collate_fn([next(dataset_iter) for _ in indices])

7, which would return an iterator of samples in this dataset.

When a subclass is used with , each item in the dataset will be yielded from the iterator. When

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

72, each worker process will have a different copy of the dataset object, so it is often desired to configure each copy independently to avoid having duplicate data returned from the workers. , when called in a worker process, returns information about the worker. It can be used in either the dataset’s

dataset_iter = iter(dataset) for indices in batch_sampler:

yield collate_fn([next(dataset_iter) for _ in indices])

7 method or the ‘s

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

01 option to modify each copy’s behavior.

Example 1: splitting workload across all workers in

dataset_iter = iter(dataset) for indices in batch_sampler:

yield collate_fn([next(dataset_iter) for _ in indices])

7:

Example 2: splitting workload across all workers using

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

01:

class torch.utils.data.TensorDataset(*tensors)

Dataset wrapping tensors.

Each sample will be retrieved by indexing tensors along the first dimension.

Parameters

*tensors () – tensors that have the same size of the first dimension.

class torch.utils.data.StackDataset(*args, **kwargs)

Dataset as a stacking of multiple datasets.

This class is useful to assemble different parts of complex input data, given as datasets.

Example

Parameters

*args () – Datasets for stacking returned as tuple.
**kwargs () – Datasets for stacking returned as dict. class torch.utils.data.ConcatDataset(datasets)

Dataset as a concatenation of multiple datasets.

This class is useful to assemble different existing datasets.

Parameters

datasets (sequence) – List of datasets to be concatenated

class torch.utils.data.ChainDataset(datasets)

Dataset for chaining multiple s.

This class is useful to assemble different existing dataset streams. The chaining operation is done on-the-fly, so concatenating large-scale datasets with this class will be efficient.

Parameters

datasets (iterable of ) – datasets to be chained together

class torch.utils.data.Subset(dataset, indices)

Subset of a dataset at specified indices.

Parameters

dataset () – The whole Dataset
indices (sequence) – Indices in the whole set selected for subset torch.utils.data._utils.collate.collate(batch, *, collate_fn_map\=None)

General collate function that handles collection type of element within each batch.

The function also opens function registry to deal with specific element types. default_collate_fn_map provides default collate functions for tensors, numpy arrays, numbers and strings.

Parameters

batch – a single batch to be collated
collate_fn_map ([[[, [, ...]], ]]) – Optional dictionary mapping from element type to the corresponding collate function. If the element type isn’t present in this dictionary, this function will go through each key of the dictionary in the insertion order to invoke the corresponding collate function if the element type is a subclass of the key.

Examples

Note

Each collate function requires a positional argument for batch and a keyword argument for the dictionary of collate functions as collate_fn_map.

torch.utils.data.default_collate(batch)

Take in a batch of data and put the elements within the batch into a tensor with an additional outer dimension - batch size.

The exact output type can be a , a Sequence of , a Collection of , or left unchanged, depending on the input type. This is used as the default function for collation when batch_size or batch_sampler is defined in .

Here is the general input type (based on the type of the element within the batch) to output type mapping:

Parameters

batch – a single batch to be collated

Examples

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

0

torch.utils.data.default_convert(data)

Convert each NumPy array element into a .

If the input is a Sequence, Collection, or Mapping, it tries to convert each element inside to a . If the input is not an NumPy array, it is left unchanged. This is used as the default function for collation when both batch_sampler and batch_size are NOT defined in .

The general input type to output type mapping is similar to that of . See the description there for more details.

Parameters

data – a single data point to be converted

Examples

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

1

torch.utils.data.get_worker_info()

Returns the information about the current iterator worker process.

When called in a worker, this returns an object guaranteed to have the following attributes:

dataset_iter = iter(dataset) for indices in batch_sampler:
```
yield collate_fn([next(dataset_iter) for _ in indices])  
```
21: the current worker id.
>>> def collate_tensor_fn(batch, *, collate_fn_map):
> # Extend this function to handle batch of tensors ... return torch.stack(batch, 0)
def custom_collate(batch): ... collate_map = {torch.Tensor: collate_tensor_fn} ... return collate(batch, collate_fn_map=collate_map)
Extend default_collate by in-place modifying default_collate_fn_map
default_collate_fn_map.update({torch.Tensor: collate_tensor_fn}) 2: the total number of workers.
dataset_iter = iter(dataset)
for indices in batch_sampler:
yield collate_fn([next(dataset_iter) for _ in indices])
23: the random seed set for the current worker. This value is determined by main process RNG and the worker id. See ’s documentation for more details.
: the copy of the dataset object in this process. Note that this will be a different object in a different process than the one in the main process.

When called in the main process, this returns

class SimpleCustomBatch:

def __init__(self, data):
    transposed_data = list(zip(*data))
    self.inp = torch.stack(transposed_data[0], 0)
    self.tgt = torch.stack(transposed_data[1], 0)
# custom memory pinning method on custom type
def pin_memory(self):
    self.inp = self.inp.pin_memory()
    self.tgt = self.tgt.pin_memory()
    return self

def collate_wrapper(batch):

return SimpleCustomBatch(batch)

inps = torch.arange(10 5, dtype=torch.float32).view(10, 5) tgts = torch.arange(10 5, dtype=torch.float32).view(10, 5) dataset = TensorDataset(inps, tgts) loader = DataLoader(dataset, batch_size=2, collate_fn=collate_wrapper,

                pin_memory=True)

for batch_ndx, sample in enumerate(loader):

print(sample.inp.is_pinned())
print(sample.tgt.is_pinned())

0.

Note

When used in a

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

01 passed over to , this method can be useful to set up each worker process differently, for instance, using

dataset_iter = iter(dataset) for indices in batch_sampler:

yield collate_fn([next(dataset_iter) for _ in indices])

29 to configure the

dataset_iter = iter(dataset) for indices in batch_sampler:

yield collate_fn([next(dataset_iter) for _ in indices])

0 object to only read a specific fraction of a sharded dataset, or use

dataset_iter = iter(dataset) for indices in batch_sampler:

yield collate_fn([next(dataset_iter) for _ in indices])

23 to seed other libraries used in dataset code.

Return type

[WorkerInfo]

torch.utils.data.random_split(dataset, lengths, generator=<torch._C.Generator object>)

Randomly split a dataset into non-overlapping new datasets of given lengths.

If a list of fractions that sum up to 1 is given, the lengths will be computed automatically as floor(frac * len(dataset)) for each fraction provided.

After computing the lengths, if there are any remainders, 1 count will be distributed in round-robin fashion to the lengths until there are no remainders left.

Optionally fix the generator for reproducible results, e.g.:

Example

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

2

Parameters

dataset () – Dataset to be split
lengths (sequence) – lengths or fractions of splits to be produced
generator () – Generator used for the random permutation. Return type

[[T]]

class torch.utils.data.Sampler(data_source\=None)

Base class for all Samplers.

Every Sampler subclass has to provide an

dataset_iter = iter(dataset) for indices in batch_sampler:

yield collate_fn([next(dataset_iter) for _ in indices])

7 method, providing a way to iterate over indices or lists of indices (batches) of dataset elements, and a

dataset_iter = iter(dataset) for indices in batch_sampler:

yield collate_fn([next(dataset_iter) for _ in indices])

2 method that returns the length of the returned iterators.

Parameters

data_source () – This argument is not used and will be removed in 2.2.0. You may still have custom implementation that utilizes it.

Example

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

3

Note

The

dataset_iter = iter(dataset) for indices in batch_sampler:

yield collate_fn([next(dataset_iter) for _ in indices])

2 method isn’t strictly required by , but is expected in any calculation involving the length of a .

class torch.utils.data.SequentialSampler(data_source)

Samples elements sequentially, always in the same order.

Parameters

data_source () – dataset to sample from

class torch.utils.data.RandomSampler(data_source, replacement\=False, num_samples\=None, generator\=None)

Samples elements randomly. If without replacement, then sample from a shuffled dataset.

If with replacement, then user can specify

dataset_iter = iter(dataset) for indices in batch_sampler:

yield collate_fn([next(dataset_iter) for _ in indices])

37 to draw.

Parameters

data_source () – dataset to sample from
replacement () – samples are drawn on-demand with replacement if for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
44, default=``False``
num_samples () – number of samples to draw, default=`len(dataset)`.
generator () – Generator used in sampling. class torch.utils.data.SubsetRandomSampler(indices, generator\=None)

Samples elements randomly from a given list of indices, without replacement.

Parameters

indices (sequence) – a sequence of indices
generator () – Generator used in sampling. class torch.utils.data.WeightedRandomSampler(weights, num_samples, replacement\=True, generator\=None)

Samples elements from

dataset_iter = iter(dataset) for indices in batch_sampler:

yield collate_fn([next(dataset_iter) for _ in indices])

39 with given probabilities (weights).

Parameters

weights (sequence) – a sequence of weights, not necessary summing up to one
num_samples () – number of samples to draw
replacement () – if for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
44, samples are drawn with replacement. If not, they are drawn without replacement, which means that when a sample index is drawn for a row, it cannot be drawn again for that row.
generator () – Generator used in sampling.

Example

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

4

class torch.utils.data.BatchSampler(sampler, batch_size, drop_last)

Wraps another sampler to yield a mini-batch of indices.

Parameters

sampler ( or Iterable) – Base sampler. Can be any iterable object
batch_size () – Size of mini-batch.
drop_last () – If for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
44, the sampler will drop the last batch if its size would be less than for index in sampler:
```
yield collate_fn(dataset[index])  
```
9

Example

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

5

class torch.utils.data.distributed.DistributedSampler(dataset, num_replicas\=None, rank\=None, shuffle\=True, seed\=0, drop_last\=False)

Sampler that restricts data loading to a subset of the dataset.

It is especially useful in conjunction with . In such a case, each process can pass a

dataset_iter = iter(dataset) for indices in batch_sampler:

yield collate_fn([next(dataset_iter) for _ in indices])

44 instance as a sampler, and load a subset of the original dataset that is exclusive to it.

Note

Dataset is assumed to be of constant size and that any instance of it always returns the same elements in the same order.

Parameters

dataset – Dataset used for sampling.
num_replicas (, optional) – Number of processes participating in distributed training. By default, dataset_iter = iter(dataset) for indices in batch_sampler:
```
yield collate_fn([next(dataset_iter) for _ in indices])  
```
46 is retrieved from the current distributed group.
rank (, optional) – Rank of the current process within dataset_iter = iter(dataset) for indices in batch_sampler:
```
yield collate_fn([next(dataset_iter) for  in indices])  
```
47. By default, dataset_iter = iter(dataset) for indices in batch_sampler:
```
yield collate_fn([next(dataset_iter) for  in indices])  
```
48 is retrieved from the current distributed group.
shuffle (, optional) – If for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
44 (default), sampler will shuffle the indices.
seed (, optional) – random seed used to shuffle the sampler if dataset_iter = iter(dataset) for indices in batch_sampler:
```
yield collate_fn([next(dataset_iter) for _ in indices])  
```
50. This number should be identical across all processes in the distributed group. Default: for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
54.
drop_last (, optional) – if for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
44, then the sampler will drop the tail of the data to make it evenly divisible across the number of replicas. If for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
45, the sampler will add extra indices to make the data evenly divisible across the replicas. Default: for indices in batch_sampler:
```
yield collate_fn([dataset[i] for i in indices])  
```
45.

Warning

In distributed mode, calling the

dataset_iter = iter(dataset) for indices in batch_sampler:

yield collate_fn([next(dataset_iter) for _ in indices])

55 method at the beginning of each epoch before creating the

for indices in batch_sampler:

yield collate_fn([dataset[i] for i in indices])

8 iterator is necessary to make shuffling work properly across multiple epochs. Otherwise, the same ordering will be always used.

là ai Hỏi Đáp Là gì

Print only the data from a form là gì năm 2024

When is a Primary Source a Secondary Source?

Dataset Types

Map-style datasets

Iterable-style datasets

Data Loading Order and

Loading Batched and Non-Batched Data

Automatic batching (default)

Disable automatic batching

Working with

Single- and Multi-process Data Loading

Extend this function to handle batch of tensors

Extend default_collate by in-place modifying default_collate_fn_map

Single-process data loading (default)

Multi-process data loading

Extend this function to handle batch of tensors

Extend default_collate by in-place modifying default_collate_fn_map

Extend this function to handle batch of tensors

Extend default_collate by in-place modifying default_collate_fn_map

Extend this function to handle batch of tensors

Extend default_collate by in-place modifying default_collate_fn_map

Extend this function to handle batch of tensors

Extend default_collate by in-place modifying default_collate_fn_map

Platform-specific behaviors

Randomness in multi-process data loading

Memory Pinning

should give same set of data as range(3, 7), i.e., [3, 4, 5, 6].ds = MyIterableDataset(start=3, end=7)

Single-process loading

Mult-process loading with two worker processes

Worker 0 fetched [3, 4]. Worker 1 fetched [5, 6].

With even more workers

should give same set of data as range(3, 7), i.e., [3, 4, 5, 6].ds = MyIterableDataset(start=3, end=7)

Single-process loading

Directly doing multi-process loading yields duplicate data

Define a worker_init_fn that configures each dataset copy differently

Mult-process loading with the custom worker_init_fn

Worker 0 fetched [3, 4]. Worker 1 fetched [5, 6].

With even more workers

Extend this function to handle batch of tensors

Extend default_collate by in-place modifying default_collate_fn_map

Extend default_collate by in-place modifying default_collate_fn_map

Bài Viết Liên Quan

MỚI CẬP NHẬP

Xem Nhiều

Chúng tôi

Điều khoản

Trợ giúp

Mạng xã hội

Extend `default_collate` by in-place modifying `default_collate_fn_map`

Extend `default_collate` by in-place modifying `default_collate_fn_map`

Extend `default_collate` by in-place modifying `default_collate_fn_map`

Extend `default_collate` by in-place modifying `default_collate_fn_map`

Extend `default_collate` by in-place modifying `default_collate_fn_map`

should give same set of data as range(3, 7), i.e., [3, 4, 5, 6].
ds = MyIterableDataset(start=3, end=7)

should give same set of data as range(3, 7), i.e., [3, 4, 5, 6].
ds = MyIterableDataset(start=3, end=7)

Define a `worker_init_fn` that configures each dataset copy differently

Mult-process loading with the custom `worker_init_fn`

Extend `default_collate` by in-place modifying `default_collate_fn_map`

Extend `default_collate` by in-place modifying `default_collate_fn_map`