One model. Every language. Zero friction.

The Universal Deep Learning Bridge

Shrew is a runtime and DSL that decouples deep learning from any single technology. Define your model once in .sw, then train and deploy it from Python, Rust, JavaScript, C++, or any language in your stack.

PythonRustC/C++JavaScriptWASMJava

Get Started

Features

Why choose Shrew

Built from the ground up in Rust for performance, flexibility, and ease of use.

Modular by Design

Use only what you need. Pick individual crates like shrew-core for tensors, shrew-cuda for GPU, or shrew-nn for layers — each works on its own.

Write Once, Run Anywhere

Define your model in a .sw file — a clean, human-readable format that separates architecture from code. No boilerplate, no transpilation headaches.

Rust Safety, Familiar Feel

Eager-mode autograd with an intuitive API, plus Rust's type safety and fearless concurrency. Catch shape bugs at compile time, not at 3am.

Python Native, Rust Powered

Research in Python, deploy in Rust — seamlessly. Zero-copy NumPy interop and a familiar API make the transition painless.

Deploy Anywhere

Small binaries, minimal dependencies, and native quantization (INT8/INT4). From cloud servers to edge devices, Shrew goes where you need it.

Built-in CLI Tools

Inspect, validate, and benchmark your models from the command line. Run shrew dump, validate, bench, or info — no extra setup required.

Code Examples

See it in action

Real examples showing how Shrew works — from basic tensor math to training a GPT model.

main.rs

use shrew::prelude::*;

fn main() -> shrew::Result<()> {
    let dev = CpuDevice;

    // Broadcasting: [3,1] + [1,2] → [3,2]
    let a = CpuTensor::from_f64_slice(
        &[1.0, 2.0, 3.0], (3, 1), DType::F64, &dev
    )?;
    let b = CpuTensor::from_f64_slice(
        &[10.0, 20.0], (1, 2), DType::F64, &dev
    )?;
    let c = a.add(&b)?;

    // Reverse-mode autograd
    let w = CpuTensor::randn((3, 3), DType::F64, &dev)?
        .set_variable();
    let x = CpuTensor::randn((2, 3), DType::F64, &dev)?;
    let loss = x.matmul(&w)?.sum_all()?;
    let grads = loss.backward()?;
    let dw = grads.get(&w).unwrap();  // ∂loss/∂w

    Ok(())
}

Performance

Blazing Fast by Default

Shrew's CPU backend uses hardware-level optimizations to squeeze every ounce of performance from your machine. Pair it with the CUDA GPU backend for even more speed.

DType support

60+

Tensor ops

Crates

CPU Benchmark — release mode (AMD/Intel)

matmul 256×256~370 µs

matmul 512×512~3.5 ms

add 1M elements~1.3 ms

linear fwd [64,512]×[512,512]+bias~3.9 ms

Architecture