AWS Announces Amazon S3 Plugin For PyTorch
Recently, AWS has announced the release of the Amazon S3 plugin for PyTorch — an open-source library built to be used with the deep learning framework PyTorch for streaming data from Amazon Simple Storage Service (Amazon S3).
With this feature available in PyTorch Deep Learning Containers, one can take advantage of using data from S3 buckets directly with PyTorch dataset and data loader APIs without needing to download it first on local storage.
It also provides a way to transfer data from Amazon S3 in parallel when needed to get maximum performance without worrying about thread safety or multiple connections to Amazon S3. You can also stream data from .zip or .tar archives and shuffle the dataset within or across the shards as required. The Amazon S3 plugin for PyTorch offers the following benefits:
Support for both map-style and iterable-style dataset interfaces – PyTorch supports two different types of datasets. In addition, the Amazon S3 plugin for PyTorch also provides the flexibility to use either map-style or iterable-style dataset interfaces based on your needs:
- Map-style dataset – Represents a map from indexes or keys to data samples. It provides random access capabilities.
- Iterable-style dataset – Represents an iterable over data samples. This type of dataset is particularly suitable for cases where random reads are expensive or even improbable and where the batch size depends on the fetched data.
Support for various data formats – Training data can be in a variety of different formats, such as CSV, Parquet, and JPEG. This plugin is file-format agnostic and presents objects in Amazon S3 as a binary buffer (blob). Thus, you can apply any additional transformations to the data received from Amazon S3.
Support for shuffling – In deep learning, you may need to shuffle data across and within shards to reduce variance. This plugin provides a way to shuffle data in-memory within shards using ShuffleDataset or across shards by providing the input parameter shuffle_urls while extending S3IterableDataset.
One can find the configuration, library and detailed information here.
A few days earlier, Amazon Web Services announced the general availability of Amazon FSx for NetApp ONTAP, a new storage service that allows customers to launch and run complete, fully managed NetApp ONTAP file systems in the cloud for the first time.