Navigating AI evolution with AWS Inferentia and Trainium - Impetus

Navigating AI evolution with AWS Inferentia and Trainium

May 2024

With so many large language models in talks these days, do you ever wonder how they evolved and what their impact is? How are they changing our lives, and finally, what do they need to provide succinct results?

In the past decade, the field of Artificial Intelligence and Machine Learning (AI/ML) has undergone an extraordinary evolution, pushing the boundaries of what was once thought possible. Many applications that have emerged during this period showcase the remarkable impact of AI/ML on various aspects of our lives, such as speech recognition, Recommender systems, image generation, search optimization, and many more.

Additionally, Generative AI plays a crucial role in supporting organizational goals such as customer experience, revenue growth, cost optimization, and business continuity. Generative models evolved with Word2Vec and GloVe, establishing the understanding of word embeddings. The mid-2010s saw advancements with models like LSTM and RNN, enhancing contextual language understanding. However, the transformative shift occurred with Transformer-based models like BERT and GPT. BERT introduced bidirectional context in 2018, while GPT, exemplified by GPT-3 with 175 billion parameters, showcased unparalleled capabilities in text generation, translation, and code completion. Emerging models like LLaMA and BLOOM continue to push the boundaries of generative AI for diverse applications.

Factors contributing to AI/ML growth

Several catalysts underpin the rapid growth of GenAI and ML. Advancements in hardware, including GPUs and specialized AI accelerators, have substantially enhanced the speed and efficiency of training complex ML models. Continuous innovation in algorithms, epitomized by breakthroughs like deep learning and transformer architectures, has propelled the capabilities of AI/ML systems.

Furthermore, the widespread adoption of AI/ML across healthcare, finance, and manufacturing industries has fueled its expansive growth. Recognizing AI’s potential to enhance efficiency, enable data-driven decisions, and unlock new business opportunities, organizations have significantly increased investments in the field.

For a seamless AI/ML journey, foundational pillars of technical robustness, security, and cost-effectiveness play a pivotal role. Addressing these aspects ensures the resilience of systems, highlighting the importance of continuous monitoring, security measures, and cost-effectiveness.

In this landscape of innovation and progress, hardware accelerators like AWS Inferentia and AWS Trainium emerge as critical components. In this blog, we delve into the intricacies and contributions of these AI accelerators crafted to navigate the complexities of large models in the ever-evolving realm of AI/ML.

AWS Inferentia: Empowering deep learning inference

Embarking on the realm of deep learning inference, AWS presents the Inferentia accelerators—a powerhouse designed to deliver remarkable performance while being mindful of costs. In the spotlight is Inferentia2, an evolution that brings noteworthy enhancements to the table.

Performance:

Inferentia2 boasts up to 4x higher throughput and an impressive 10x reduction in latency when compared to its predecessor, Inferentia1. The design is meticulously optimized to deploy intricate models, catering specifically to large language models (LLM) and vision transformers.

What sets Inf2 instances apart is their distinction as the pioneer inference-optimized instances in Amazon EC2, introducing support for scale-out distributed inference. This breakthrough enables the efficient and cost-effective deployment of models featuring hundreds of billions of parameters.

Seamless integration:

Supported by AWS Neuron, Inferentia2 comes equipped with an SDK seamlessly integrated with industry-standard ML frameworks like PyTorch and TensorFlow. This integration allows for quick adoption of ML chips without necessitating changes to the application or model.

Performance:

Benchmarking Inf2 against Inf1 instances and other inference-optimized Amazon EC2 instances reveals substantial performance gains. With up to 4x higher throughput, 10x lower latency compared to EC2 Inf1 instances, and a commendable 50 percent improvement in performance per watt, Inferentia2 stands tall in the world of AI accelerators.

Source: AWS Blog

Sizing options for tailored workloads:

Inferentia2 instances are available in four sizes for those seeking customization in AI workloads. Powered by up to 12 AWS Inferentia2 chips and 192 vCPUs, these instances offer a combined compute power of 2.3 petaFLOPS at BF16 or FP16 data types. The architecture features an ultra-high-speed NeuronLink interconnect between chips, offering up to 384 GB of shared accelerator memory.

Inferentia2 architecture:

Each chip of Inferentia2 comprises two cores delivering a total of 380 INT8 TOPS, 190 FP16/BF16/cFP8/TF32 TFLOPS, and 47.5 FP32 TFLOPS. The NeuronLink, with a speed of 384 GB/sec per device, facilitates sharding models across two or more cores. NeuronCore-v2 introduces a modular design with four independent engines—ScalarEngine, VectorEngine, TensorEngine, and GPSIMD-Engine, the latter being a novel addition.

AWS Inferentia2 also has a larger and faster internal memory, when compared to AWS Inferentia1.

AWS Trainium: Accelerating deep learning training

Amazon Elastic Compute Cloud (EC2) Trn1 instances, fueled by AWS Trainium accelerators, stand as purpose-built solutions for high-performance deep learning (DL) training, catering to generative AI models, including large language models (LLMs) and latent diffusion models. The prominent features of Trn1 instances include:

  • Native support of a broad spectrum of data types, including FP32, TF32, BF16, FP16, UINT8, and configurable FP8
  • Hardware-accelerated stochastic rounding to ensure high performance and increased accuracy compared to legacy rounding modes
  • Seamless integration of AWS Neuron SDK, supporting Trainium, with PyTorch and TensorFlow, allowing users to continue with their existing workflows

Benchmarking against EC2 P4d instances:

  • 1.4x the teraFLOPS for BF16 data types.
  • 2.5x more teraFLOPS for TF32 data types.
  • 5x the teraFLOPS for FP32 data types.
  • 4x inter-node network bandwidth.
  • Up to 50 percent cost-to-train savings.
  • 150% higher throughput.

Tailored sizing for your needs:

Trn1 instances are available in three sizes, powered by up to 16 AWS Trainium chips with 128 vCPUs. These instances provide high-performance networking and storage, supporting efficient data and model parallelism for distributed training.

Trainium architecture:

At the core of the Trn1 instance are 16 Trainium devices, each housing 2 NeuronCore-v2 cores, offering 380 INT8 TOPS, 190 FP16/BF16/cFP8/TF32 TFLOPS, and 47.5 FP32 TFLOPS. Each Trainium device features:

  • Compute capabilities.
  • 32GiB of device memory for storing model state.
  • 1 TB/sec of DMA bandwidth with inline memory compression/decompression.
  • NeuronLink-v2 for efficient device-to-device interconnect.
  • Programmability supporting dynamic shapes and control flow, user-programmable rounding modes, and custom operators through GPSIMD engines.
  • Integration with ML Frameworks: AWS Neuron SDK

AWS Neuron SDK facilitates high-performance DL acceleration. It supports training on Trainium-based EC2 Trn1 instances and offers low-latency inference on EC2 Inf1 and Inf2 instances powered by Inferentia accelerators. The integration ensures smooth workflow continuation with TensorFlow and PyTorch frameworks, requiring only a few lines of code changes.

Inferentia and Trainium Pricing Estimates

Currently Inferentia and Trainium are available in limited regions in AWS like US East (N. Virginia), US West (Oregon) . These instances are available in various pricing options such as On-Demand, Reserved, Spot Instances, or through a Savings Plan. As always with Amazon EC2, users are charged only for the resources consumed.

Instance TypevCPUsMemoryStorageOn-Demand Hourly Cost
inf1.xlarge48 GiBEBS only0.22
inf1.2xlarge816 GiBEBS only0.36
inf1.6xlarge2448 GiBEBS only1.18
inf2.8xlarge32128 GiBEBS only1.96
trn1.2xlarge832 GiB1 x 475 NVMe SSD1.34
trn1.32xlarge128512 GiB4 x 1900 NVMe SSD21.5
trn1n.32xlarge128512 GiB4 x 1900 NVMe SSD24.78

For more details please refer AWS pricing – https://calculator.aws/#/createCalculator/ec2-enhancement
NVMe – Nonvolatile Memory Express, EBS – Elastic Block Storage

Choosing AWS Inferentia and Trainium

Both Trainium and Inferentia accelerators are seamlessly integrated with Ray on Amazon EC2, enhancing scalability and performance for machine learning and generative AI workloads. The choice of AWS Trainium and AWS Inferentia depends on your specific use case and requirements. Each accelerator is designed for different purposes within the AI/ML ecosystem.

AWS Trainium is designed for high-performance deep learning training, making it suitable for tasks like training large language models (LLMs) and generative AI models. It offers native support for various data types, hardware-accelerated stochastic rounding, and seamless integration with popular ML frameworks like PyTorch and TensorFlow. Trainium instances deliver high teraFLOPS across different data types, cost-to-train savings, and increased throughput. They are optimized for distributed training with efficient data and model parallelism.

AWS Inferentia, on the other hand, is tailored for deep learning inference applications, providing high performance at a low cost. It supports deploying complex models, including large language models (LLMs) and vision transformers. AWS Inferentia instances are optimized for scale-out distributed inference, delivering high throughput and lower latency than previous versions. They are suitable for efficiently and cost-effectively deploying models with hundreds of billions of parameters.

As organizations embark on their GenAI journey, harnessing the power of these accelerators becomes instrumental in unlocking the full potential of AI applications. Impetus Technologies, with its commitment to innovation, stands at the forefront, leveraging GenAI services to drive meaningful transformations across industries. In a world increasingly shaped by AI, the GenAI accelerators pave the way for a future where intelligence and innovation converge seamlessly.

Authors:

Kumar Gaurav, Senior Technical Architect GenAI Cloud & Data Engineering

Arujit Das

Learn more about how our work can support your enterprise