## **IEEE Custom Integrated Circuits Conference** 13-2: KeyRAM: A 0.34 $\mu$ J/decision 18 k decisions/s Recurrent Attention In-memory **Processor for Keyword Spotting**

Hassan Dbouk, Sujan K. Gonugondla, Charbel Sakr, and Naresh R. Shanbhag University of Illinois at Urbana-Champaign

03/24/2020





## Outline

## Motivation and Background

- Recurrent Attention Model for KWS
- Implementation
  - Chip Architecture
  - Sparsity-aware IMC Block
  - DM<sup>2</sup>VM Digital Block
- Measurement Results
- Summary



## **Motivation**

- Speech is a natural mode for humans to interact with intelligent Edge devices
- Edge devices are often constrained in terms of *storage, power,* and *compute resources*
- Keyword spotting (KWS) systems are used to detect specific wake-up words





## **Motivation**

- Speech is a natural mode for humans to interact with intelligent Edge devices
- Edge devices are often constrained in terms of *storage, power,* and *compute resources*
- Keyword spotting (KWS) systems are used to detect specific wake-up words



Goal: An end-to-end energy efficient and low latency solution for keyword spotting



## **Typical KWS Pipeline**



- Feature extraction: Mel-frequency Cepstral Coefficient (MFCC)
- What is a good classifier?

#### Prior works

| NN model   | S(80KB, 6MOps) |        |        |  |  |
|------------|----------------|--------|--------|--|--|
|            | Acc.           | Mem.   | Ops    |  |  |
| DNN        | 84.6%          | 80.0KB | 158.8K |  |  |
| CNN        | 91.6%          | 79.0KB | 5.0M   |  |  |
| Basic LSTM | 92.0%          | 63.3KB | 5.9M   |  |  |
| LSTM       | 92.9%          | 79.5KB | 3.9M   |  |  |
| GRU        | 93.5%          | 78.8KB | 3.8M   |  |  |
| CRNN       | 94.0%          | 79.7KB | 3.0M   |  |  |
| DS-CNN     | 94.4%          | 38.6KB | 5.4M   |  |  |

#### Hello Edge [Zhang, arXiv 2018]

## Vanilla RNN for KWS



subset of features to be processed

• Sequential processing: must process the entire MFCC input features



## Vanilla RNN for KWS (t=1)





## Vanilla RNN for KWS (t=2)





## Vanilla RNN for KWS (t=3)





## Vanilla RNN for KWS (t=N)





## Vanilla RNN for KWS (t=N)

#### Can we design a better classifier?





## Outline

• Motivation and Background

### Recurrent Attention Model for KWS

- Implementation
  - Chip Architecture
  - Sparsity-aware IMC Block
  - DM<sup>2</sup>VM Digital Block
- Measurement Results
- Summary



## **Recurrent Attention Model (RAM) for KWS**



• Originally proposed for image classification [Mnih, NIPS'14]



## **Recurrent Attention Model (RAM) for KWS**



• RAM: processes the input via glimpses, learns what glimpses to process



## **Recurrent Attention Model (RAM) for KWS**



• More glimpses processed  $\rightarrow$  more confident decisions

滬 CICC

• Inherent accuracy-complexity (energy & latency) tradeoff

## **Efficiency of RAM for KWS**

• KWS for 12 keywords using the Google Speech dataset

As reported by [Zhang, arXiv 2018]





## **Efficiency of RAM for KWS**

- KWS for 12 keywords using the Google Speech dataset
- RAM achieves a 4.6 × reduction in computational complexity at iso-accuracy





## **Efficiency of RAM for KWS**

- KWS for 12 keywords using the Google Speech dataset
- RAM achieves a 4.6 × reduction in computational complexity at iso-accuracy





## **Architectural Choices**

#### Digital



#### In-Memory Compute (IMC)



energy efficient – massive parallelism

non reconfigurable – low precision

#### 滬 CICC

**Pros** 

Cons

## **Architectural Choices**





Cons

## **Mapping of RAM**

- 6 fully connected layers (fc1 to fc6)
- All weights on-chip

| layer | $d_{in}$ | $\boldsymbol{d}_{\text{out}}$ | $B_x$ | $B_w$ | #MACs | %MACs | Mapped to |
|-------|----------|-------------------------------|-------|-------|-------|-------|-----------|
| fc1   | 2        | 63                            | 8     | 8     | 189   | 0.35  | DIGITAL   |
| fc2   | 64       | 64                            | 8     | 8     | 4160  | 7.63  | DIGITAL   |
| fc3   | 127      | 127                           | 4     | 4     | 16256 | 29.81 | IMC       |
| fc4   | 254      | 127                           | 4     | 4     | 32385 | 59.39 | IMC       |
| fc5   | 127      | 10                            | 8     | 8     | 1280  | 2.35  | DIGITAL   |
| fc6   | 127      | 2                             | 8     | 8     | 256   | 0.47  | DIGITAL   |

fc3 & fc4: 89% of computations





## Outline

- Motivation and Background
- Recurrent Attention Model for KWS

### Implementation

- Chip Architecture
- Sparsity-aware IMC Block
- DM<sup>2</sup>VM Digital Block
- Measurement Results
- Summary



- Main controller
- Two IMC blocks
- Four single-slope ADCs
- Digital processor





- Main controller
- Synchronizes all chip operations
- 6 main modes of operation
- Runs on a 1GHz external clock





- Main controller
- Two IMC blocks
- $512 \times 256$  standard 6T SRAM banks
- Execute fc3 and fc4



- Main controller
- Two IMC blocks
- Four single-slope ADCs
- Operate at 10 M Sample/s
- Two 6-b ADCs required per IMC dot product (differential design)





- Main controller
- Two IMC blocks
- Four single-slope ADCs
- Digital processor
- 6kB of SRAM + 64 8b MAC units
- Executes fc1, fc2, fc5, & fc6



## Outline

- Motivation and Background
- Recurrent Attention Model for KWS

### Implementation

- Chip Architecture
- Sparsity-aware IMC Block
- DM<sup>2</sup>VM Digital Block
- Measurement Results
- Summary





- Standard 6T SRAM bank
- Multi-bit dot products via four stages



Adapted from [Gonugondla, ISSCC'18]





- Standard 6T SRAM bank
- Multi-bit dot products via four stages

#### Pulse-width modulated word-lines perform D2A conversion of weights on each bit-line (BL)





- Standard 6T SRAM bank
- Multi-bit dot products via four stages

#### 2

BL discharges are multiplied with the corresponding input data from buffers via charge redistribution





- Standard 6T SRAM bank
- Multi-bit dot products via four stages

#### Multiplier outputs are summed across the columns via charge sharing across BLs





- Standard 6T SRAM bank
- Multi-bit dot products via four stages

## Final dot product voltage is converted to digital via ADCs



## **Input Sparsity Challenge**



- ReLU activation functions cause sparse inputs (~ 50% - 70%)
- Output voltage spread shrinks due to charge sharing







## Outline

- Motivation and Background
- Recurrent Attention Model for KWS

### Implementation

- Chip Architecture
- Sparsity-aware IMC Block
- DM<sup>2</sup>VM Digital Block
- Measurement Results
- Summary



## DM<sup>2</sup>VM: Digital Processor

outputs streamed out

on cycles 4-7

- Array of 64 8b MAC PEs
- 6kB of SRAM for weight storage
- Flexible support (fc1, fc2, fc5, fc6)
- Designed to minimize idle cycles when inputs/outputs are streamed in/out
- Completes an  $N \times M$  MVM in a fixed number N + M of cycles



Principle of the diagonal major MVM (DM<sup>2</sup>VM) processor for a  $4 \times 4$  FC layer



## Outline

- Motivation and Background
- Recurrent Attention Model for KWS
- Implementation
  - Chip Architecture
  - Sparsity-aware IMC Block
  - DM<sup>2</sup>VM Digital Block
- Measurement Results
- Summary



## **System Performance**

- Energy/throughput tunable by varying V<sub>WL</sub> and number of glimpses
- Measured results per glimpse:

| Energy/glimpse | Latency/glimpse |  |  |
|----------------|-----------------|--|--|
| 0.11µJ         | 18.2µs          |  |  |



## **System Performance**

- Energy/throughput tunable by varying V<sub>WL</sub> and number of glimpses
- Measured results per glimpse:

| Energy/glimpse | Latency/glimpse |  |  |
|----------------|-----------------|--|--|
| $0.11 \mu$ J   | 18.2µs          |  |  |



1000imes faster than a typical human reaction time





IMC: consumes 68% of the total energy, and implements 89% of computations

RAM\* KeyRAM digital RNN\* \*estimated from DM<sup>2</sup>VM measurements

 $2 \times$ 

 $3.7 \times$ 

RAM

IMC

 $7.4 \times \text{better energy/dec compared}$ to a digital RNN implementation

3

2.5

2

1.5

0.5

energy/dec [ $\mu$ ]

滬 CICC

**Energy Measurements** 

## **Measured Classification on Google Speech**







correct classification of one keyword after three glimpses



test set accuracy increases with number of glimpses



## **Chip Micrograph**

| Technology      | 65nm             |  |  |
|-----------------|------------------|--|--|
| Die Size        | 1.78mm × 2.32 mm |  |  |
| Memory Capacity | 38kB             |  |  |
| Nominal Supply  | 1.0V             |  |  |
| CTRL Frequency  | 1GHz             |  |  |
| Latency         | 0.05ms – 0.15ms  |  |  |
| Energy/dec      | 0.34µJ — 1.043µJ |  |  |
| Algorithm       | RAM              |  |  |

1.78 mm

ч



2.32 mm



## **Comparison with State-of-the-art**

|                                                                                                                                                                          | ISSCC'17          | CICC'18             | ESCCIRC'18                         | VLSI'19       | This Work                                      |  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------|---------------------|------------------------------------|---------------|------------------------------------------------|--|
| Technology                                                                                                                                                               | 65 nm             | 65 nm               | 65 nm                              | 65 nm         | 65 nm                                          |  |
| Algorithm                                                                                                                                                                | DNN               | LSTM                | LSTM                               | Binarized-RNN | RAM                                            |  |
| Dataset                                                                                                                                                                  | TIDIGITS          | TIMIT               | TIMIT                              | Google Speech | Google Speech                                  |  |
| # of Classes                                                                                                                                                             | 11                | 39                  | 4 <sup>a</sup>                     | 10            | 7                                              |  |
| Test Accuracy [%]                                                                                                                                                        | 98.35             | 80.4                | —                                  | 90.2          | 90.38                                          |  |
| On-chip Storage [kB]                                                                                                                                                     | 747.52            | 82                  | 32                                 | 18            | 38                                             |  |
| Area [mm <sup>2</sup> ]                                                                                                                                                  | 9.61              | 1.57                | 1.035                              | 6.2           | 4.13                                           |  |
| Energy/Decision [µJ]                                                                                                                                                     | 6.4 <sup>d</sup>  | $9.54^{\mathrm{d}}$ | 0.06                               | 3.36          | $0.34 - 1.043^{\rm b} \ (0.57 - 1.62)^{\rm c}$ |  |
| Decisions Latency [ms]                                                                                                                                                   | $37^{d}$          | $0.77^{ m d}$       | 12 <sup>d</sup>                    | 0.13          | $0.05-0.15^{ m b}$                             |  |
| # of MACs/Decision                                                                                                                                                       | —                 | —                   | $5.8 \mathrm{k} - 27.2 \mathrm{k}$ | —             | $273\mathrm{k}-730\mathrm{k^b}$                |  |
| Energy-Delay Product [pJ.s]                                                                                                                                              | $239\mathrm{k^d}$ | $7.3\mathrm{k^d}$   | 720                                | 430           | $18 - 152^{\rm b} \ (31 - 236)^{\rm c}$        |  |
| Supply Voltage [V]                                                                                                                                                       | 0.6 - 1.2         | 0.75 - 1.24         | 0.575                              | 0.9 - 1.1     | 1                                              |  |
| Energy Efficiency [TOPS/W]                                                                                                                                               | _                 | 3.08                | _                                  | 11.7          | $1.6 \ (0.91)^{c}$                             |  |
| <sup>a</sup> 4 binary classifiers <sup>b</sup> with changing $V_{WI}$ and # of glimpses <sup>c</sup> with CTRL energy included <sup>d</sup> estimated from reported data |                   |                     |                                    |               |                                                |  |

- Lowest reported decision latency
- More than 23 imes reduction in EDP

## **Comparison with State-of-the-art**

|                             | ISSCC'17          | CICC'18             | ESCCIRC'18                         | VLSI'19       | This Work                                      |
|-----------------------------|-------------------|---------------------|------------------------------------|---------------|------------------------------------------------|
| Technology                  | 65 nm             | 65 nm               | 65 nm                              | 65 nm         | 65 nm                                          |
| Algorithm                   | DNN               | LSTM                | LSTM                               | Binarized-RNN | RAM                                            |
| Dataset                     | TIDIGITS          | TIMIT               | TIMIT                              | Google Speech | Google Speech                                  |
| # of Classes                | 11                | 39                  | $4^{\mathrm{a}}$                   | 10            | 7                                              |
| Test Accuracy [%]           | 98.35             | 80.4                | —                                  | 90.2          | 90.38                                          |
| On-chip Storage [kB]        | 747.52            | 82                  | 32                                 | 18            | 38                                             |
| Area [mm <sup>2</sup> ]     | 9.61              | 1.57                | 1.035                              | 6.2           | 4.13                                           |
| Energy/Decision [µJ]        | $6.4^{\rm d}$     | $9.54^{\mathrm{d}}$ | 0.06                               | 3.36          | $0.34 - 1.043^{\rm b}$ $(0.57 - 1.62)^{\rm c}$ |
| Decisions Latency [ms]      | $37^{\rm d}$      | $0.77^{\rm d}$      | $12^{d}$                           | 0.13          | $0.05-0.15^{ m b}$                             |
| # of MACs/Decision          | —                 | —                   | $5.8 \mathrm{k} - 27.2 \mathrm{k}$ | —             | $273\mathrm{k}-730\mathrm{k^b}$                |
| Energy-Delay Product [pJ.s] | $239\mathrm{k^d}$ | $7.3\mathrm{k^d}$   | 720                                | 430           | $18 - 152^{\rm b} \ (31 - 236)^{\rm c}$        |
| Supply Voltage [V]          | 0.6 - 1.2         | 0.75 - 1.24         | 0.575                              | 0.9 - 1.1     | 1                                              |
| Energy Efficiency [TOPS/W]  | —                 | 3.08                | —                                  | 11.7          | $1.6 \ (0.91)^{c}$                             |

<sup>a</sup>4 binary classifiers <sup>b</sup> with changing  $V_{WL}$  and # of glimpses

<sup>c</sup>with CTRL energy included

<sup>d</sup>estimated from reported data

- Lowest reported decision latency
- More than  $23 \times reduction$  in EDP

滬 CICC

#### $3 \times -10 \times$ reduction in energy/decision

## Outline

- Motivation and Background
- Recurrent Attention Model for KWS
- Implementation
  - Chip Architecture
  - Sparsity-aware IMC Block
  - DM<sup>2</sup>VM Digital Block
- Measurement Results

## Summary



## **Summary**

- Energy efficient and low latency KWS systems are of utmost importance
- We adopt an algorithm-hardware co-design approach by proposing:
  - Novel classification algorithm for KWS using RAM
  - Sparsity-aware IMC-based computations for energy efficient dot product operations
- KeyRAM: a classifier IC in 65nm for KWS achieving state-of-the-art decision latency of  $50\mu$ s with  $< 0.5\mu$ J/decision



## Acknowledgements

## This work was sponsored by the AFRL and DARPA under agreement FA8650-18-2-7866 as part of the FRANC program.



# Thank you



## **Backup Slides**



## **Mel-frequency Cepstral Coefficients (MFCC)**



