mogoz

GPGPU

tags
Floating Point , Concurrency , Flynn’s Taxonomy , Machine Learning

https://modal.com/gpu-glossary/readme 🌟

Learning resources

CUDA

The CUDA stack

Level Component Category Key Components Installation Status on Modal Key Characteristics
0 Kernel-mode driver NVIDIA Accelerated Graphics Driver (e.g., 570.86.15) Already installed Tightly integrated with host OS; not user-modifiable. Communicates directly with GPU.
1 User-mode driver API CUDA Driver API (libcuda.so), NVIDIA Management Library (libnvidia-ml.so), nvidia-smi CLI Already installed on all Modal machines with GPU access Allows user-space programs to interact with kernel drivers. nvidia-smi checks GPU status.
2 CUDA Toolkit CUDA Runtime API (libcudart.so), Compiler (nvcc), nvrtc, cudnn, profilers, etc. Not installed by default Wraps CUDA Driver API. Tools for developing/debugging CUDA programs.

Note on CUDA Toolkit Installation

  • Not System-Wide by Default: Components of the CUDA Toolkit (like the CUDA Runtime API) are not pre-installed across the entire system on Modal.
  • Python Package Dependencies: Many Python libraries (e.g., torch) bundle necessary CUDA Toolkit components (like nvidia-cuda-runtime-cu12) as pip-installable dependencies. This makes CUDA available to that specific Python application.
  • System-Level Requirement: Some tools or libraries require the CUDA Toolkit to be installed system-wide, not just within a Python environment. These tools may not find pip-installed CUDA components.
  • Solution for System-Level: For applications needing a system-wide CUDA Toolkit, Modal recommends using official nvidia/cuda Docker images, which come with the full toolkit pre-installed.

Resources

FAQ

Different kinds of hardware for ML

Feature CPU GPU APU¹ TPU FPGA ASIC²
Primary Use General Compute Graphics, Parallel Combined CPU+GPU ML Accelerate (NN/Tensor) Reconfigurable Logic Single Task Optimized
Architecture Few Powerful Cores Many Simple Cores Mixed CPU/GPU Cores Matrix/Tensor ASIC Customizable Logic Grid Custom Fixed Hardware
ML Since Always 2000s (GPGPU), 2012 2010s (Integrated) 2015 (Internal), 2018 Mid-2010s (Accel.) Mid/Late 2010s (ML)
ML Prevalence System Base, Light ML Very High (Training) Moderate (Edge/PC) Growing (Google Cloud) Niche (Low Latency) Growing Fast (Infer.)
ML Advant. Flexible, Sequential Parallelism, Ecosystem Balanced, Power/Cost Eff Perf/Watt (Matrix), Scale Customizable, Low Latency Max Perf/Watt (Task)
ML Limits Poor Parallel Power, Sparse Data Shared Resource Limits Less Flexible, Ecosystem Complex Dev, HW Skill Inflexible, High NRE
ML Use Cases Data Prep, Orchestrate DL Training, Inference Edge AI, Mixed Loads Large Scale DL (GCP) Real-time Inference High-Vol Inference

Nvdia GPUs

CUDA core

  • CUDA cores each core can only do one multiply-accumulate(MAC) on 2 FP32 values
  • eg. x += x*y

Tensor core

  • Tensor core can take a 4x4 FP16 matrix and multiply it by another 4x4 FP16 matrix then add either a FP16/FP32 4x4 matrix to the resulting product and return it as a new matrix.
  • Certain Tensor cores added support for INT8 and INT4 precision modes for quantization.
  • Now there are various architecture variants that Nvdia build upon, Like Turing Tensor, Ampere Tensor etc.

See Category:Nvidia microarchitectures - Wikipediaexternal link

RAM

???

VRAM

  • Memory = how big the model is allowed to be

Performance

  • Typically measured in floating point operations per second or FLOPS / GFLOPS
  • Good if the no. of floating point operations per memory access is high

Floating Point support

See Floating Point

  • GPUs support half, single and double precisions
  • double precision support on GPUs is fairly recent.
  • GPU vendors have their own things and support

F32

float32 is very widely used in gaming.

  • float32 multiplication is really a 24-bit multiplication, which is about 1/2 the cost of a 32-bit multiplication. So an int32 multiplication is about 2x as expensive as a float32 multiplication.
  • On modern desktop GPUs, the difference in performance (FLOPS) between float32 and float64 is close to 4x

Frameworks

  • OpenCL: Dominant open GPGPU computing language
  • OpenAI Titron: Language and compiler for parallel programming
  • CUDA: Dominant proprietary framework

More on CUDA

  • Graphic cards support upto certain cuda version. Eg. my card when nvidia-smi is run shows CUDA 12.1, it doesn’t mean cuda is installed
  • So I can install cudatoolkit around that version.
  • But cudatoolkit is separate from nvdia driver. You can possibly run cudatoolkit for your graphic card without having the driver.
  • Pytorch

    • Eg. To run Pytorch you don’t need cudatoolkit because they ship their own CUDA runtime and math libs.
    • Local CUDA toolkit will be used if we build PyTorch from source etc.
    • If pytorch-cuda is built w cuda11.7, you need cuda11.7 installed in your machine. Does it not ship the runtime????
    • nvcc is the cuda compiler
    • torhaudio: https://pytorch.org/audio/main/installation.html
  • Setting up CUDA on NixOS

    • So installing nvidia drivers is different game. Which has nothing to do with cuda. Figure that shit out first, that should go in configuration.nix or whatever configures the system.
    • Now for the CUDA runtime, there are few knobs. But most importantly LD_LIBRARY_PATH should not be set globally. See this: Problems with rmagik / glibc: `GLIBCXX_3.4.32' not found - #7 by rgoulter - Help - NixOS Discourseexternal link
    • So install all CUDA stuff in a flake, and we should be good.
    • Check versions
      • nvidia-smi will give the cuda driver version
      • After installing pkgs.cudaPackages.cudatoolkit you’ll have nvcc in your path.
        • Running nvcc --version will give local cuda version
    • For flake
      postShellHook = ''
      #export LD_DEBUG=libs; # debugging
      
      export LD_LIBRARY_PATH="${pkgs.lib.makeLibraryPath [
        pkgs.stdenv.cc.cc
        # pkgs.libGL
        # pkgs.glib
        # pkgs.zlib
      
        # NOTE: for why we need to set it to "/run/opengl-driver", check following:
        # - This is primarily to get libcuda.so which is part of the
        #   nvidia kernel driver installation and not part of
        #   cudatoolkit
        # - https://github.com/NixOS/nixpkgs/issues/272221
        # - https://github.com/NixOS/nixpkgs/issues/217780
        # NOTE: Instead of using /run/opengl-driver we could do
        #       pkgs.linuxPackages.nvidia_x11 but that'd get another
        #       version of libcuda.so which is not compatiable with the
        #       original driver, so we need to refer to the stuff
        #       directly installed on the OS
        "/run/opengl-driver"
      
        # "${pkgs.cudaPackages.cudatoolkit}"
        "${pkgs.cudaPackages.cudnn}"
      ]}"
      '';
      
    • Other packages
      • sometimes we need to add these to LD_LIBRARY_PATH directly
      • pkgs.cudaPackages.cudatoolkit
      • pkgs.cudaPackages.cudnn
      • pkgs.cudaPackages.libcublas
      • pkgs.cudaPackages.cuda_cudart
      • pkgscudaPackages.cutensor