SimpleRISC ALU — High-Performance 32-bit ALU Design

A fully synthesizable, high-performance Arithmetic Logic Unit (ALU) designed for the SimpleRISC single-cycle processor core, implemented in Verilog RTL. The design targets a 250 MHz clock frequency and supports all arithmetic, logical, shift, and comparison operations defined by the SimpleRISC ISA.

Overview

This project replaces the placeholder ALU in the given SimpleRISC RTL core with a performance-optimized implementation. All functional units run in parallel (combinational), with a final MUX selecting the result based on the 4-bit op control signal. This avoids sequential bottlenecks and allows the synthesizer to meet the 250 MHz timing target.

Inputs:

a[31:0] — Operand A
b[31:0] — Operand B
op[3:0] — Operation select

Outputs:

y[31:0] — Result
zero — Flag: asserted when y == 0

ALU Operation Set

Opcode	Mnemonic	Description
`0000`	ADD	Signed/unsigned addition
`0001`	SUB	Signed/unsigned subtraction
`0010`	AND	Bitwise AND
`0011`	OR	Bitwise OR
`0100`	XOR	Bitwise XOR
`0101`	SLT	Set to 1 if A < B (signed), else 0
`0110`	SLL	Logical shift left
`0111`	SRL	Logical shift right
`1000`	SRA	Arithmetic shift right
`1001`	PASS	Pass operand B through
`1010`	NOT	Bitwise NOT of operand B
`1011`	MUL	Signed 32×32 multiplication
`1100`	DIV	Signed division (quotient)
`1101`	MOD	Signed modulus (remainder)

Architecture

All functional units are instantiated simultaneously and compute results in parallel. The top-level ALU mux selects among them based on op.

Kogge-Stone Parallel Prefix Adder

File: rtl/fast_adders.v

The add/subtract unit uses a 32-bit Kogge-Stone adder (fast_ksa32). This is a parallel prefix adder that computes all carry signals in O(log₂N) logic stages, yielding minimal critical path delay compared to a ripple-carry or carry-lookahead design.

Subtraction is performed via 2's complement: b is inverted and cin=1 is asserted when op == SUB.
A 64-bit variant (fast_ksa64) is also included for use inside the multiplier's final accumulation stage.

Radix-4 Booth Multiplier with Dadda Tree

Files: rtl/multiplier.v, rtl/encoder.v, rtl/tree_reducer.v

The multiplier (comb_mult32x32) uses a three-stage pipeline of combinational logic:

Radix-4 Booth Encoding (radix4_booth_encoder): Encodes the 32-bit multiplier into 16 partial products (each 64-bit wide). This halves the number of partial products compared to a naive approach, reducing tree height.
Dadda Tree Reduction (dadda_tree_reducer): Compresses the 16 partial products into two 64-bit operands (sum + carry) using cascaded 3:2 carry-save adders (CSAs). CSAs eliminate carry propagation during compression.
Final Kogge-Stone Addition (kogge_stone_adder_64bit): Adds the two 64-bit outputs from the Dadda tree using the fast KS adder, producing the 64-bit signed product. The lower 32 bits are returned as the ALU result.

Non-Restoring Signed Divider

File: rtl/divider.v

The divider (signed_divider32) uses a non-restoring division algorithm operating on absolute values:

Sign bits are extracted and the result sign is computed (num_sign XOR den_sign).
The non-restoring loop iterates 32 times, conditionally adding or subtracting the divisor based on the sign of the running remainder.
A correction step restores the remainder if it ends negative.
Final sign correction is applied to both quotient and remainder.
Division by zero returns 0 for both outputs.

Note: The iterative loop unrolls fully in combinational synthesis, making this a single-cycle operation at the cost of area.

Barrel Shifter

File: rtl/shifter.v

The barrel shifter (shifter32_opt) supports all three shift modes via a single unified 5-stage mux chain:

SLL (Logical Left Shift): Input bits are reversed before the shift stages, then reversed again at output — converting a right-shift network into a left-shift with zero hardware duplication.
SRL (Logical Right Shift): Standard right shift, filling with 0.
SRA (Arithmetic Right Shift): Right shift filling with the sign bit (input[31]).

Each of the 5 stages conditionally shifts by 1, 2, 4, 8, or 16 positions based on each bit of shift_amt[4:0].

SLT Unit

File: rtl/slt.v

The Set-Less-Than unit (slt32_opt) computes signed comparison by subtracting B from A using a Kogge-Stone adder and examining the result sign bit, with overflow detection to handle mixed-sign edge cases correctly:

overflow = (sign_a != sign_b) AND (sign_diff == sign_b)
result   = sign_diff XOR overflow

Project Structure

SimpleRISC-ALU/
├── rtl/                        # Synthesizable RTL source files
│   ├── alu.v                   # Top-level ALU (integrates all units)
│   ├── fast_adders.v           # 32-bit and 64-bit Kogge-Stone adders + CSA
│   ├── multiplier.v            # 32×32 Booth multiplier top-level
│   ├── encoder.v               # Radix-4 Booth encoder (16 partial products)
│   ├── tree_reducer.v          # Dadda tree partial product reducer
│   ├── divider.v               # 32-bit signed non-restoring divider
│   ├── shifter.v               # 32-bit barrel shifter (SLL/SRL/SRA)
│   ├── slt.v                   # Signed set-less-than comparator
│   ├── simplerisc_top.v        # SimpleRISC processor top-level
│   ├── control_unit.v          # Instruction decoder / control logic
│   ├── regfile.v               # 32×32 register file
│   ├── imem.v                  # Instruction memory
│   ├── immu.v                  # Instruction memory management unit
│   └── decode.vh               # Decode macros / opcode definitions
├── tb/                         # Testbenches
│   ├── tb_alu.v                # Standalone ALU testbench
│   └── tb_simplerisc.v         # Full SimpleRISC core testbench
├── tools/
│   └── asm.py                  # SimpleRISC assembler (Python)
├── docs/
│   └── COA2_Design_Report.docx # Full design report
├── program.asm                 # Assembly test program
├── program.hex                 # Assembled hex for instruction memory
├── Makefile                    # Build and simulation targets
└── .gitignore

Getting Started

Prerequisites

Icarus Verilog (iverilog, vvp) for simulation
GTKWave for waveform viewing (optional)
Python 3.x for the assembler

Install on Ubuntu/Debian:

sudo apt install iverilog gtkwave python3

Running Simulation

# Compile and run full SimpleRISC simulation
make run

# Build only (no run)
make build

# View waveform (requires GTKWave)
make wave

# Clean build artifacts
make clean

To run the standalone ALU testbench:

iverilog -g2012 -o alu_tb.vvp tb/tb_alu.v rtl/alu.v && vvp alu_tb.vvp

Assembling a Program

Use the provided Python assembler to convert .asm → .hex:

python3 tools/asm.py program.asm program.hex

The resulting program.hex is loaded into instruction memory by the testbench.

Test Cases

The program.asm file includes the following test operations, covering all major ALU paths:

Instruction	Operation	Expected Behaviour
`div r4, r1, r2`	DIV	`-103 / 10 = -10` (signed quotient)
`mod r5, r1, r2`	MOD	`-103 mod 10 = -3` (signed remainder)
`mul r6, r1, r2`	MUL	`-103 × 10 = -1030`
`add r7, r1, r2`	ADD	`-103 + 10 = -93`
`sub r8, r1, r2`	SUB	`-103 - 10 = -113`
`asr r9, r3, #4`	SRA	`-1 >> 4 = -1` (arithmetic, sign-extended)
`lsr r10, r3, #4`	SRL	`-1 (0xFFFF) >> 4 = 0x0FFFFFFF`
`and r11, r1, r2`	AND	Bitwise AND of -103 and 10
`or r12, r1, r2`	OR	Bitwise OR of -103 and 10
`not r13, r1`	NOT	Bitwise NOT of -103

Additional custom test cases to add (minimum 5 required per assignment spec):

SLT with equal operands: slt r0, r5, r5 → expect r0 = 0
SLT with negative < positive: slt r0, r1, r2 (−103 < 10) → expect r0 = 1
XOR self-clear: xor r0, r3, r3 → expect r0 = 0, zero = 1
MUL overflow check: multiply two large values, verify lower 32 bits only
DIV by zero: divisor = 0, expect quotient = 0 (safe default behaviour)

Design Trade-offs

Design Choice	Benefit	Cost
Kogge-Stone adder (O(log N) depth)	Meets 250 MHz; minimal carry chain delay	Higher fan-out, more wiring
Radix-4 Booth encoding	Halves partial products (32→16)	Encoder logic overhead
Dadda tree vs Wallace tree	Slightly lower gate count	Marginally more complex routing
Fully unrolled combinational divider	Single-cycle latency	Large area; long critical path for DIV/MOD
Parallel instantiation of all units	Clean mux-select structure; synthesizer optimizes unused paths	All units always powered

The divider has the longest critical path of any functional unit. If 250 MHz is not met post-synthesis, consider pipelining the divider (2–4 stages) while keeping all other units single-cycle.

Computer Organization and Architecture — Assignment 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SimpleRISC ALU — High-Performance 32-bit ALU Design

Table of Contents

Overview

ALU Operation Set

Architecture

Kogge-Stone Parallel Prefix Adder

Radix-4 Booth Multiplier with Dadda Tree

Non-Restoring Signed Divider

Barrel Shifter

SLT Unit

Project Structure

Getting Started

Prerequisites

Running Simulation

Assembling a Program

Test Cases

Design Trade-offs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
rtl		rtl
tb		tb
tools		tools
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
program.asm		program.asm
program.hex		program.hex

Folders and files

Latest commit

History

Repository files navigation

SimpleRISC ALU — High-Performance 32-bit ALU Design

Table of Contents

Overview

ALU Operation Set

Architecture

Kogge-Stone Parallel Prefix Adder

Radix-4 Booth Multiplier with Dadda Tree

Non-Restoring Signed Divider

Barrel Shifter

SLT Unit

Project Structure

Getting Started

Prerequisites

Running Simulation

Assembling a Program

Test Cases

Design Trade-offs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages