HISA Triton Ascend Operator Deliverable

This workspace implements the HISA indexer path from HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention for Ascend 910B-oriented Triton workflows.

Quick Start

cd /workspace
export PYTHONPATH=/workspace/src:$PYTHONPATH
python -m compileall -q src tests scripts
bash scripts/run_unit_tests.sh
HISA_TEST_DEVICE=npu python tests/test_hisa_correctness.py

Run demos:

python scripts/run_demo.py --device cpu
python scripts/run_demo.py --device npu --L 64 --Q 8 --H 2 --D 16 --block-size 8 --top-m 4 --top-k 16

Run msprof profiling:

bash scripts/profile_msprof.sh function small
KERNEL_NAME=_select_tokens_from_blocks_kernel bash scripts/profile_msprof.sh op small

Main Files

src/hisa_reference.py: PyTorch reference for DSA/HISA semantics.
src/hisa_triton.py: Triton kernels and hisa_select wrapper.
src/hisa_layout.py: packed sequence adapter.
tests/test_hisa_correctness.py: no-pytest test runner.
scripts/profile_hisa_only.py: msprof app that avoids reference validation ops.
scripts/profile_msprof.sh: function-level and op-level profiling wrapper.
reports/performance_report.md: msprof-backed performance report.

Current Behavior

hisa_select keeps the original public semantics: input queries [Q,H,D], keys [L,D], weights [Q,H], optional query_positions [Q], output [Q, top_k] int32 token indices.

The default path on NPU:

Triton mean-pools token keys to block keys.
Triton selects candidate blocks with top-m and forced first/current-last block logic.
Triton directly selects final token indices for moderate candidate windows.
For large candidate windows, it falls back to candidate-score matrix plus torch.topk to avoid UB overflow.

Controls:

HISA_USE_FUSED_BLOCKS=0: disable fused block selection.
HISA_USE_FUSED_TOPK=0: disable fused final token selection.
HISA_FUSED_TOPK_MAX_ELEMS: guard for candidate_count * D; default 8192.

Current Status

CPU correctness: 10/10 PASS.
NPU correctness: 10/10 PASS.
Small-case msprof shows fused path improves device task duration versus MVP bridge.
Large windows still require tiled top-k work to avoid fallback.

Quick Start

cd /workspace
export PYTHONPATH=/workspace/src:$PYTHONPATH
python -m compileall -q src tests scripts
bash scripts/run_unit_tests.sh
HISA_TEST_DEVICE=npu python tests/test_hisa_correctness.py

Run demos:

python scripts/run_demo.py --device cpu
python scripts/run_demo.py --device npu --L 64 --Q 8 --H 2 --D 16 --block-size 8 --top-m 4 --top-k 16

Run msprof profiling:

bash scripts/profile_msprof.sh function small
KERNEL_NAME=_select_tokens_from_blocks_kernel bash scripts/profile_msprof.sh op small

Main Files

src/hisa_reference.py: PyTorch reference for DSA/HISA semantics.

src/hisa_triton.py: Triton kernels and hisa_select wrapper.

src/hisa_layout.py: packed sequence adapter.

tests/test_hisa_correctness.py: no-pytest test runner.

scripts/profile_hisa_only.py: msprof app that avoids reference validation ops.

scripts/profile_msprof.sh: function-level and op-level profiling wrapper.

reports/performance_report.md: msprof-backed performance report.

Current Behavior

hisa_select keeps the original public semantics: input queries [Q,H,D], keys [L,D], weights [Q,H], optional query_positions [Q], output [Q, top_k] int32 token indices.

The default path on NPU:

Triton mean-pools token keys to block keys.

Triton selects candidate blocks with top-m and forced first/current-last block logic.

Triton directly selects final token indices for moderate candidate windows.

For large candidate windows, it falls back to candidate-score matrix plus torch.topk to avoid UB overflow.

Controls:

HISA_USE_FUSED_BLOCKS=0: disable fused block selection.

HISA_USE_FUSED_TOPK=0: disable fused final token selection.

HISA_FUSED_TOPK_MAX_ELEMS: guard for candidate_count * D; default 8192.