Notes collection on NPUs and Edge devices (for self)

December 6, 2024

inference companies

Reading: https://substack.com/home/post/p-152106163, https://substack.com/@chipstrat

ASIC vs SOC, which is more optimal for my use case?

Edge stuff:

Quick summary from Claude: Let me break down what each of the edge inference companies shown in the image would specialize in for edge AI deployment:

Apple: Focuses on running AI models directly on iPhones, iPads, and other Apple devices, optimizing for their custom silicon (like the Neural Engine in their chips) to run AI workloads efficiently while maintaining privacy.

Google: Specializes in running TensorFlow Lite models on mobile and IoT devices, with specific optimizations for Android devices and their Edge TPU architecture.

Qualcomm: Leverages their Snapdragon platforms and Neural Processing Units (NPUs) to run AI models on mobile devices, especially Android phones, with their AI Engine SDK.

NVIDIA: Provides Jetson platforms and TensorRT optimization software for running AI models on edge devices, particularly strong in computer vision applications.

Intel: Offers OpenVINO toolkit for optimizing and deploying AI models on their hardware, including CPUs and their Movidius vision processing units (VPUs).

AMD: Provides tools and solutions for running AI models on their CPUs and GPUs in edge devices, with specific optimizations for their hardware architecture.

Hailo: Creates specialized AI processors and software tools specifically designed for edge devices, focusing on high performance with low power consumption.

Untether AI: Develops at-memory computation architecture for AI acceleration at the edge, optimizing for both speed and energy efficiency.

DeGirum: Offers solutions for optimizing and deploying AI models on various edge devices with their software stack.

SiMa.ai: Provides purpose-built MLSoC (Machine Learning System-on-Chip) platform for embedded edge applications.

Blaize: Creates Graph Streaming Processor (GSP) architecture specifically for AI workloads at the edge, with their own software stack.

Axelera: Develops AI acceleration solutions focused on power-efficient edge deployment, particularly for industrial and IoT applications.

The key themes across these companies are:

  1. Hardware optimization for AI workloads
  2. Power efficiency for battery-operated devices
  3. Model compression and optimization
  4. Support for real-time processing
  5. Focus on specific use cases like computer vision, natural language processing, or sensor data analysis

At first glance, Untether AI and Hailo seem to be what Im looking for, jk idk what they do

Hailo -> specialized AI accelerators and vision processors What do they exactly do and what is the advantage of them over others?

On a side note: Edge inference and cloud inference, doing both possible? with latter a user opt. Let me break down AI accelerators and vision processors within ASICs systematically:

AI Accelerators:

  • These are specialized circuits designed to speed up AI/ML computations, particularly matrix multiplications and convolutions
  • They typically include:
    1. Multiple Processing Elements (PEs) that perform parallel computations
    2. Local memory/buffers to store weights and activations
    3. Custom datapaths optimized for AI operations
    4. Specialized arithmetic units (often using lower precision like INT8 or FP16)

Vision Processors:

  • Specialized circuits focused on image/video processing and computer vision tasks
  • Key components include:
    1. Image Signal Processors (ISP) for raw sensor data processing
    2. Vector processing units for pixel operations
    3. Dedicated hardware for common CV operations (filtering, feature detection)
    4. Hardware-accelerated video codecs

Integration within an ASIC:

asic rough

asicrough

asicrough

Key aspects of this integration:

  1. Data Flow
  • Input data comes through external memory interface
  • DMA controller manages data movement between components
  • Local memory buffers minimize external memory access
  • Results flow back through the system bus to output or storage
  1. Processing Pipeline
  • Vision processor often processes raw data first
  • Results feed into AI accelerator for inference
  • Control processor orchestrates the flow
  • Both blocks can work in parallel for efficiency
  1. Memory Hierarchy
  • External DRAM for bulk storage
  • On-chip SRAM for fast access
  • Local buffers within each block
  • Cache hierarchy for frequently accessed data
  1. Optimization Techniques:
  • Data reuse to minimize memory access
  • Pipeline parallelism between blocks
  • Custom datapaths for common operations
  • Clock and power gating for unused blocks
  1. Common Challenges:
  • Memory bandwidth bottlenecks
  • Power consumption management
  • Thermal considerations
  • Balancing flexibility vs. efficiency

This architecture allows for efficient processing of AI and vision workloads while maintaining flexibility for different applications. The exact configuration would depend on the specific use case requirements for power, performance, and area constraints. ||Claude

NPU stuff: https://www.chipstrat.com/p/the-circuit-episode-70-npus