Memory Ordering Unit
Description
The hs_npu_memory_ordering
module is a control unit responsible for orchestrating data flow between memory and the inference unit in the NPU. It uses a state machine to manage loading of input matrices, weights, biases, and sums, as well as saving output data back to memory. Key functionalities include handling memory requests, controlling data read and write operations, and setting up FIFOs for input, weight, and output buffers. This module also supports reusing weights and inputs, applying biases, and shifting values for activation functions. Its execution can be configured through control signals, with outputs aligned for the systolic array dimensions, ensuring data preparation for matrix multiplication and inference tasks.
Overview and Responsibilities
-
State Machine:
- The module transitions between states like
IDLE
,LOADING_WEIGHTS
,LOADING_INPUTS
,LOADING_BIAS
,LOADING_SUMS
,READY_TO_COMPUTE
, andSAVING
. Each state has specific responsibilities, and transitions occur based on conditions such as memory validity signals and reuse flags.
- The module transitions between states like
-
Data Loading and Memory Requests:
- During the loading phases, this module issues memory requests and loads data (weights, inputs, bias, sums) from memory or previous inference results based on the configuration. It uses
BURST_SIZE
to control the amount of data per transfer.
Danger
As discussed earlier, changing the default value of
BURST_SIZE
might cause errors. - During the loading phases, this module issues memory requests and loads data (weights, inputs, bias, sums) from memory or previous inference results based on the configuration. It uses
-
Computation Control:
- In the
READY_TO_COMPUTE
state, the module controls gates (start_input_gatekeeper
andstart_output_gatekeeper
) and enables the computation within the inference unit, coordinating the input and output flow through the gatekeepers.
Info
This is absolutely central to the NPU, as this timing is very sentitive, one cycle too early or too late will produce incomplete or incorrect results.
- In the
-
Saving Results:
- After computation, in the
SAVING
state, it can write results back to memory ifsave_outputs_in
is asserted. It uses memory write signals (mem_write_valid_o
andmemory_data_out
) to save computed outputs sequentially.
- After computation, in the
-
FIFO Control:
- The module manages FIFO flushing and readiness signals to synchronize data flow within the NPU, particularly for input, weight, and output FIFOs.
State Descriptions
-
IDLE: Initializes parameters and waits for a valid execution signal (
exec_valid_i
). When received, captures layer parameters, resets relevant counters, and transitions toLOADING_WEIGHTS
. -
LOADING_WEIGHTS: Loads weights from memory into
output_weights
until the required rows are loaded orreuse_weights
is set. When done, it transitions toLOADING_INPUTS
. -
LOADING_INPUTS: Loads input data from memory (or reuses prior inputs if
reuse_inputs
is set). After loading, it transitions toLOADING_BIAS
. -
LOADING_BIAS: If
use_bias
is asserted, loads bias values intobias
. When completed, moves toLOADING_SUMS
. -
LOADING_SUMS: Similar to the bias load, but for sums if
use_sum
is asserted. After loading, transitions toREADY_TO_COMPUTE
. -
READY_TO_COMPUTE: Activates input and output gatekeepers for controlled data flow into the inference unit. Sets up the module for computation based on
computation_cycles
and then transitions toSAVING
to store results. -
SAVING: Saves the output data to memory if
save_outputs_in
is asserted. Once all data is saved, resets for the next cycle and goes back toIDLE
.Info
Note that the unit assumes that all the data is concurrently stored in memory, in a specific order!
Control Signal Assignments
- exec_ready_o: Indicates readiness for a new operation when in
IDLE
. - mem_read_ready_o and mem_write_valid_o: Control memory read/write based on the state and internal flags.
- flush_input_fifos, flush_weight_fifos, and flush_output_fifos: Manage FIFO flushing in relevant states.
Output Data and Gatekeeper Configuration
output_weights
,output_inputs
,output_bias
, andoutput_sums
hold data that will be transferred to inference and accumulation units.start_input_gatekeeper
,start_output_gatekeeper
, andenable_cycles_gatekeeper
configure gatekeepers, allowing smooth data flow within the processing unit.
I/O Table
Input Signals
Input Name | Direction | Type | Description |
---|---|---|---|
clk |
Input | logic |
Clock signal for synchronization. |
rst_n |
Input | logic |
Active-low reset signal. |
exec_valid_i |
Input | logic |
Indicates that an execution request is valid. |
mem_valid_i |
Input | logic |
Indicates that a memory read or write request is valid. |
mem_ready_i |
Input | logic |
Memory interface ready signal for data transfers. |
memory_data_in |
Input | uword [BURST_SIZE] |
Data matrix values read from memory. |
num_input_rows_in |
Input | uword |
Number of rows in the input matrix. |
num_input_columns_in |
Input | uword |
Number of columns in the input matrix. |
num_weight_rows_in |
Input | uword |
Number of rows in the weight matrix. |
num_weight_columns_in |
Input | uword |
Number of columns in the weight matrix. |
reuse_inputs_in |
Input | logic |
Control signal to reuse inputs across computations. |
reuse_weights_in |
Input | logic |
Control signal to reuse weights across computations. |
save_outputs_in |
Input | logic |
Control signal to save outputs after computation. |
use_bias_in |
Input | logic |
Enables bias addition in the computation. |
use_sum_in |
Input | logic |
Enables sum accumulation in the computation. |
shift_amount_in |
Input | uword |
Specifies the amount to shift results after computation. |
activation_select_in |
Input | logic |
Selects the activation function to apply to results. |
base_address_in |
Input | uword |
Base address for memory accesses. |
result_address_in |
Input | uword |
Address to store the computation results. |
inference_result |
Input | logic [INPUT_DATA_WIDTH-1:0] [SIZE] |
Final output from inference. |
Output Signals
Output Name | Direction | Type | Description |
---|---|---|---|
exec_ready_o |
Output | logic |
Indicates readiness to accept a new execution command. |
finished |
Output | logic |
Indicates that the current operation has completed. |
mem_read_ready_o |
Output | logic |
Indicates readiness for memory read operations. |
mem_write_valid_o |
Output | logic |
Indicates that a memory write operation is valid. |
mem_invalidate |
Output | logic |
Signals to invalidate read data. |
memory_data_out |
Output | uword [BURST_SIZE] |
Data matrix values written to memory. |
request_address |
Output | uword |
Address for requesting data from memory. |
flush_input_fifos |
Output | logic |
Signal to flush the input FIFOs. |
input_fifo_valid_o |
Output | logic |
Indicates that data in input FIFO is valid. |
flush_weight_fifos |
Output | logic |
Signal to flush the weight FIFOs. |
weight_fifo_valid_o |
Output | logic |
Indicates that data in weight FIFO is valid. |
flush_output_fifos |
Output | logic |
Signal to flush the output FIFOs. |
output_fifo_ready_o |
Output | logic |
Indicates that the output FIFO is ready. |
output_fifo_reread |
Output | logic |
Signal to reread data from the output FIFO. |
bias_enable |
Output | logic |
Enable signal for bias in computation. |
weight_enable |
Output | logic |
Enable signal for weight data in computation. |
start_input_gatekeeper |
Output | logic |
Signal to start the input gatekeeper. |
start_output_gatekeeper |
Output | logic |
Signal to start the output gatekeeper. |
enable_cycles_gatekeeper |
Output | uword |
Number of cycles for enabling the gatekeeper. |
activation_select_out |
Output | logic |
Output activation function selection. |
shift_amount_out |
Output | uword |
Output shift amount for the computation result. |
output_weights |
Output | logic [INPUT_DATA_WIDTH-1:0] [SIZE] |
Output weight matrix values. |
output_inputs |
Output | logic [INPUT_DATA_WIDTH-1:0] [SIZE] |
Output input matrix values. |
output_bias |
Output | logic [OUTPUT_DATA_WIDTH-1:0] [SIZE] |
Output bias values. |
output_sums |
Output | logic [OUTPUT_DATA_WIDTH-1:0] [SIZE] |
Output sums balues. |
State Machine Diagram
This diagram presents a basic overview of the state machine and its transitions.
Related Files
File Name | Type |
---|---|
hs_npu_memory_ordering | Top |
Additional Comments
As discussed earlier, refactoring this module would resolve most of the current inflexibilities of the ScaleNPU in terms of sizes and parameters. Up until now, following the uarch section order, this is the only module that hardcodes specific sizes and parameters into the logic.
It should also be noted that this module is quite complex, having to manage both the AXI sizes, bursts, and synchronization, along with the MM unit requirements and special timing. While this flexibility would provide a nice boost in performance, using the default values (whose correct functionality has been validated) is still significantly faster than using the ScaleCore-V for inference. The current pain point is the software interface, as the programmer must directly interact with the CSRs. A software driver would be the most beneficial addition to the ScaleNPU at this time.