Neuromorphic Benchmarking
Neuromorphic Processing Element Array

The AI accelerator is built around a 2-D mesh of identical processing elements (PEs), each pairing an L1 Spad SRAM with a tunable arithmetic pipeline whose multiplier bit-width, adder bit-width, and vector length can be set per workload. PEs exchange data over short, pipelined links across the mesh, while a centrally positioned, multi-ported L2-class global buffer streams weights and feature maps to the array and gathers partial results. This tiered memory hierarchy—private L1 Spad for high-reuse data and a shared L2 for larger tensors—minimizes off-chip traffic and lets the accelerator scale smoothly from energy-efficient INT8 inference to higher-precision operations simply by retargeting the per-PE arithmetic parameters without altering the mesh fabric or software stack.We also optimize dataflow and throughput by clustering multiple processing element during model compilation time.
Neuromorphic workload | Input resolution/sequence length | Latency(ms) | Throughput(FPS) | Area(mm²) | $ per token |
---|---|---|---|---|---|
Mobilenet v2 [1] | 224 x 224 x 3 | 10 | 1328 | 907.69 | |
GEMM [1] | 224 x 224 x 3 | 3929 | 3.101 | 907.69 | |
Mnasnet [1] | 224 x 224 x 3 | 32.2 | 332.16 | 907.76 | |
Resnext50 [1] | 224 x 224 x 3 | 1480 | 6.811 | 907.76 | |
squeezenet [1] | 224 x 224 x 3 | 34.4 | 116.89 | 907.76 | |
Transformer [1] | 224 x 224 x 3 | 9.5 | 6150 | 907.76 | |
unet [1] | 224 x 224 x 3 | 3286 | 2.23 | 907.6 | |
Vgg16 [1] | 224 x 224 x 3 | 121 | 63.4 | 907.76 | |
BERT-L [2] | 128/128 | 3.7 seconds | 806.89 |