This workspace implements the HISA indexer path from HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention for Ascend 910B-oriented Triton workflows.
cd /workspace
export PYTHONPATH=/workspace/src:$PYTHONPATH
python -m compileall -q src tests scripts
bash scripts/run_unit_tests.sh
HISA_TEST_DEVICE=npu python tests/test_hisa_correctness.pyRun demos:
python scripts/run_demo.py --device cpu
python scripts/run_demo.py --device npu --L 64 --Q 8 --H 2 --D 16 --block-size 8 --top-m 4 --top-k 16Run msprof profiling:
bash scripts/profile_msprof.sh function small
KERNEL_NAME=_select_tokens_from_blocks_kernel bash scripts/profile_msprof.sh op smallsrc/hisa_reference.py: PyTorch reference for DSA/HISA semantics.src/hisa_triton.py: Triton kernels and hisa_select wrapper.src/hisa_layout.py: packed sequence adapter.tests/test_hisa_correctness.py: no-pytest test runner.scripts/profile_hisa_only.py: msprof app that avoids reference validation ops.scripts/profile_msprof.sh: function-level and op-level profiling wrapper.reports/performance_report.md: msprof-backed performance report.hisa_select keeps the original public semantics: input queries [Q,H,D], keys [L,D], weights [Q,H], optional query_positions [Q], output [Q, top_k] int32 token indices.
The default path on NPU:
torch.topk to avoid UB overflow.Controls:
HISA_USE_FUSED_BLOCKS=0: disable fused block selection.HISA_USE_FUSED_TOPK=0: disable fused final token selection.HISA_FUSED_TOPK_MAX_ELEMS: guard for candidate_count * D; default 8192.