ako4all

TongmingLAIC / AKO4ALL

Drive an agentic loop that iteratively optimizes a GPU kernel for maximum speedup. Use this skill whenever the user wants to optimize / speed up / benchmark a GPU kernel (CUDA, Triton, TileLang, C++, Python), mentions AKO / AKO4ALL / AKO4X / agentic kernel optimization, asks to "make this kernel faster", or has a kernel they want measured against a PyTorch reference. The skill handles setup, profiling (ncu), correctness checking, iteration logging, and git commits. Bootstraps a workspace in any directory the user points at.

Docs ↗Install

Drive a profile → modify → benchmark → log → commit loop on a GPU kernel until it runs faster than the reference. The user provides at minimum a kernel; everything else (reference, inputs, bench script, hints) is optional.

When this skill applies

"optimize this kernel" / "speed up this CUDA / Triton / TileLang kernel"
"run AKO / AKO4ALL on ..."
"benchmark this kernel against PyTorch"
"iterate on this kernel until it's faster"
mentions of ncu, kernel profiling, GPU speedup target

Does NOT apply when:

User wants to write a new kernel from scratch with no optimization target — just write code, no loop.
User wants Codex / GPT to review or implement — use codex:rescue instead.
User wants generic performance advice for code that isn't a GPU kernel.

First action

Before doing anything else, establish the workspace — the directory the loop runs in. It is typically the user's CWD, or a subdirectory / path they name in the prompt.

Inventory the workspace + prompt

Browse the workspace (don't run a fixed checklist — look around) and read the user's prompt to identify what the loop needs:

Read full spec ↗

SKILL.md