Ascend-SACT/hisa
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

HISA Triton Ascend Operator Deliverable

This workspace implements the HISA indexer path from HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention for Ascend 910B-oriented Triton workflows.

Quick Start

cd /workspace
export PYTHONPATH=/workspace/src:$PYTHONPATH
python -m compileall -q src tests scripts
bash scripts/run_unit_tests.sh
HISA_TEST_DEVICE=npu python tests/test_hisa_correctness.py

Run demos:

python scripts/run_demo.py --device cpu
python scripts/run_demo.py --device npu --L 64 --Q 8 --H 2 --D 16 --block-size 8 --top-m 4 --top-k 16

Run msprof profiling:

bash scripts/profile_msprof.sh function small
KERNEL_NAME=_select_tokens_from_blocks_kernel bash scripts/profile_msprof.sh op small

Main Files

  • src/hisa_reference.py: PyTorch reference for DSA/HISA semantics.
  • src/hisa_triton.py: Triton kernels and hisa_select wrapper.
  • src/hisa_layout.py: packed sequence adapter.
  • tests/test_hisa_correctness.py: no-pytest test runner.
  • scripts/profile_hisa_only.py: msprof app that avoids reference validation ops.
  • scripts/profile_msprof.sh: function-level and op-level profiling wrapper.
  • reports/performance_report.md: msprof-backed performance report.

Current Behavior

hisa_select keeps the original public semantics: input queries [Q,H,D], keys [L,D], weights [Q,H], optional query_positions [Q], output [Q, top_k] int32 token indices.

The default path on NPU:

  1. Triton mean-pools token keys to block keys.
  2. Triton selects candidate blocks with top-m and forced first/current-last block logic.
  3. Triton directly selects final token indices for moderate candidate windows.
  4. For large candidate windows, it falls back to candidate-score matrix plus torch.topk to avoid UB overflow.

Controls:

  • HISA_USE_FUSED_BLOCKS=0: disable fused block selection.
  • HISA_USE_FUSED_TOPK=0: disable fused final token selection.
  • HISA_FUSED_TOPK_MAX_ELEMS: guard for candidate_count * D; default 8192.

Current Status

  • CPU correctness: 10/10 PASS.
  • NPU correctness: 10/10 PASS.
  • Small-case msprof shows fused path improves device task duration versus MVP bridge.
  • Large windows still require tiled top-k work to avoid fallback.