在 `timm` 中创建数据加载器最简单的方法是调用 `timm.data.loader` 中的 `create_loader` 函数。它需要一个 `dataset` 对象，一个 `input_size` 参数以及一个 `batch_size`。所有其他参数都已预设好，以便于使用。让我们快速看一下如何使用 `timm` 创建数据加载器的示例。

示例用法

!tree ../../imagenette2-320/ -d

../../imagenette2-320/
├── train
│   ├── n01440764
│   ├── n02102040
│   ├── n02979186
│   ├── n03000684
│   ├── n03028079
│   ├── n03394916
│   ├── n03417042
│   ├── n03425413
│   ├── n03445777
│   └── n03888257
└── val
    ├── n01440764
    ├── n02102040
    ├── n02979186
    ├── n03000684
    ├── n03028079
    ├── n03394916
    ├── n03417042
    ├── n03425413
    ├── n03445777
    └── n03888257

22 directories

from timm.data.dataset import ImageDataset

dataset = ImageDataset('../../imagenette2-320/')
dataset[0]

(<PIL.Image.Image image mode=RGB size=426x320 at 0x7F8379C26190>, 0)

很好，我们已经创建了数据集。`timm` 中的 `ImageDataset` 与 torchvision.datasets.ImageFolder 非常相似，并增加了一些不错的功能。让我们可视化数据集中的第一张图片。正如预期的那样，这是一条丁鲷的图片！;)

注意： 默认情况下，上面创建的数据集是用于训练文件夹的，因此我们可以将其称为训练数据集。

from matplotlib import pyplot as plt

# visualize image
plt.imshow(dataset[0][0])

<matplotlib.image.AxesImage at 0x7f83702a7bd0>

现在让我们创建我们的 DataLoader。

from timm.data.loader import create_loader

try:
    # only works if gpu present on machine
    train_loader = create_loader(dataset, (3, 224, 224), 4)
except:
    train_loader = create_loader(dataset, (3, 224, 224), 4, use_prefetcher=False)

在这里，你可能会问为什么上面有一个 `try-except` 块？第一个 `train_loader` 和第二个有什么区别？`use_prefetcher` 参数是什么，它有什么作用？

预取加载器

`timm` 内部有一个名为 `PrefetchLoader` 的类。默认情况下，我们使用这个预取加载器来创建我们的数据加载器。但是，它只在启用了 GPU 的机器上工作。由于我的机器有 GPU，所以我的 `train_loader` 是 `PrefetchLoader` 类的一个实例。

train_loader

<timm.data.loader.PrefetchLoader at 0x7f836fd8c9d0>

注意： 如果你在只有 CPU 的机器上运行此 notebook，则 `train_loader` 将是 `torch.utils.dataloader` 的一个实例。

现在让我们看看这个 `PrefetchLoader` 有什么作用？所有有趣的部分都发生在这个类的 `__iter__` 方法中。

def __iter__(self):
        stream = torch.cuda.Stream()
        first = True

        for next_input, next_target in self.loader:
            with torch.cuda.stream(stream):
                next_input = next_input.cuda(non_blocking=True)
                next_target = next_target.cuda(non_blocking=True)
                if self.fp16:
                    next_input = next_input.half().sub_(self.mean).div_(self.std)
                else:
                    next_input = next_input.float().sub_(self.mean).div_(self.std)
                if self.random_erasing is not None:
                    next_input = self.random_erasing(next_input)

            if not first:
                yield input, target
            else:
                first = False

            torch.cuda.current_stream().wait_stream(stream)
            input = next_input
            target = next_target

        yield input, target

让我们试着理解实际发生了什么？要理解 `PrefetchLoader` 中的 `__iter__` 方法，我们只需要了解 `cuda.stream`。

来自 PyTorch 上的文档

A CUDA stream is a linear sequence of execution that belongs to a specific device. You normally do not need to create one explicitly: by default, each device uses its own “default” stream.

Operations inside each stream are serialized in the order they are created, but operations from different streams can execute concurrently in any relative order, unless explicit synchronization functions (such as synchronize() or wait_stream()) are used.


When the “current stream” is the default stream, PyTorch automatically performs necessary synchronization when data is moved around. However, when using non-default streams, it is the user’s responsibility to ensure proper synchronization.

简单来说，每个 CUDA 设备都可以有自己的“流”，这是一系列按顺序运行的命令。但这并不意味着所有流（如果存在多个 CUDA 设备）都是同步的。可能的情况是，当命令-1 在第一个 CUDA 设备的“流”上运行时，命令-3 正在第二个 CUDA 设备的“流”上运行。

但这有什么关系？可以使用“流”来使我们的数据加载器更快吗？

当然！这就是关键所在！基本上，Ross 对 `PrefetchLoader` 背后的核心思想是这样说的：

"The prefetching with async cuda transfer helps a little to reduce likelihood of the batch transfer to GPU stalling by (hopefully) initiating it sooner and giving it more flexibility to operate in its own cuda stream concurrently with other ops."

基本上，我们是在设备自己的“流”中执行“移动到 CUDA”这一步骤，而不是在默认流中。这意味着这一步骤可以异步执行，同时 CPU 或默认“流”上可能正在进行其他操作。这有助于稍微加快速度，因为现在数据已在 `CUDA` 上可用，可以更快地通过模型。

这就是 `__iter__` 方法内部发生的事情。

对于第一个批次，我们像在 `torch.utils.data.DataLoader` 中通常那样迭代加载器，并返回 `input` 和 `target`。

但是，对于之后的每个批次——我们首先使用 `with torch.cuda.stream(stream):` 为 CUDA 设备实例化一个“流”，然后，我们在这个设备自己的“流”中以异步方式执行 `CUDA` 传输，并生成（yield）这个 `next_input` 和 `next_target`。

因此，每次我们迭代数据加载器时，实际上都返回了一个预取的 `input` 和 `target`，这就是 `PrefetchLoader` 的名字由来。