TongmingLAIC / AKO4ALL
ako4all
Drive an agentic loop that iteratively optimizes a GPU kernel for maximum speedup. Use this skill whenever the user wants to optimize / speed up / benchmark a GPU kernel (CUDA, Triton, TileLang, C++, Python), mentions AKO / AKO4ALL / AKO4X / agentic kernel optimization, asks to "make this kernel faster", or has a kernel they want measured against a PyTorch reference. The skill handles setup, profiling (ncu), correctness checking, iteration logging, and git commits. Bootstraps a workspace in any directory the user points at.
Drive a profile → modify → benchmark → log → commit loop on a GPU kernel until it runs faster than the reference. The user provides at minimum a kernel; everything else (reference, inputs, bench script, hints) is optional.
When this skill applies
- "optimize this kernel" / "speed up this CUDA / Triton / TileLang kernel"
- "run AKO / AKO4ALL on ..."
- "benchmark this kernel against PyTorch"
- "iterate on this kernel until it's faster"
- mentions of
ncu, kernel profiling, GPU speedup target
Does NOT apply when:
- User wants to write a new kernel from scratch with no optimization target — just write code, no loop.
- User wants Codex / GPT to review or implement — use
codex:rescueinstead. - User wants generic performance advice for code that isn't a GPU kernel.
First action
Before doing anything else, establish the workspace — the directory the loop runs in. It is typically the user's CWD, or a subdirectory / path they name in the prompt.
Inventory the workspace + prompt
Browse the workspace (don't run a fixed checklist — look around) and read the user's prompt to identify what the loop needs:
SKILL.md