Neuromorphic Processing Element Array

The AI accelerator is built around a 2-D mesh of identical processing elements (PEs), each pairing an L1 Spad SRAM with a tunable arithmetic pipeline whose multiplier bit-width, adder bit-width, and vector length can be set per workload. PEs exchange data over short, pipelined links across the mesh, while a centrally positioned, multi-ported L2-class global buffer streams weights and feature maps to the array and gathers partial results. This tiered memory hierarchy—private L1 Spad for high-reuse data and a shared L2 for larger tensors—minimizes off-chip traffic and lets the accelerator scale smoothly from energy-efficient INT8 inference to higher-precision operations simply by retargeting the per-PE arithmetic parameters without altering the mesh fabric or software stack.We also optimize dataflow and throughput by clustering multiple processing element during model compilation time.

Neuromorphic workload Input resolution/sequence length Latency(ms) Throughput(FPS) Area(mm²) $ per token

Neuromorphic workload	Input resolution/sequence length	Latency(ms)	Throughput(FPS)	Area(mm²)
Mobilenet v2 ^[1]	224 x 224 x 3	10	1328	907.69
GEMM ^[1]	224 x 224 x 3	3929	3.101	907.69
Mnasnet ^[1]	224 x 224 x 3	32.2	332.16	907.76
Resnext50 ^[1]	224 x 224 x 3	1480	6.811	907.76
squeezenet ^[1]	224 x 224 x 3	34.4	116.89	907.76
Transformer ^[1]	224 x 224 x 3	9.5	6150	907.76
unet ^[1]	224 x 224 x 3	3286	2.23	907.6
Vgg16 ^[1]	224 x 224 x 3	121	63.4	907.76
BERT-L ^[2]	128/128	3.7 seconds		806.89

Neuromorphic Benchmarking

Neuromorphic Processing Element Array