A fully synthesizable, high-performance Arithmetic Logic Unit (ALU) designed for the SimpleRISC single-cycle processor core, implemented in Verilog RTL. The design targets a 250 MHz clock frequency and supports all arithmetic, logical, shift, and comparison operations defined by the SimpleRISC ISA.
- Overview
- ALU Operation Set
- Architecture
- Project Structure
- Getting Started
- Test Cases
- Design Trade-offs
This project replaces the placeholder ALU in the given SimpleRISC RTL core with a performance-optimized implementation. All functional units run in parallel (combinational), with a final MUX selecting the result based on the 4-bit op control signal. This avoids sequential bottlenecks and allows the synthesizer to meet the 250 MHz timing target.
Inputs:
a[31:0]— Operand Ab[31:0]— Operand Bop[3:0]— Operation select
Outputs:
y[31:0]— Resultzero— Flag: asserted wheny == 0
| Opcode | Mnemonic | Description |
|---|---|---|
0000 |
ADD | Signed/unsigned addition |
0001 |
SUB | Signed/unsigned subtraction |
0010 |
AND | Bitwise AND |
0011 |
OR | Bitwise OR |
0100 |
XOR | Bitwise XOR |
0101 |
SLT | Set to 1 if A < B (signed), else 0 |
0110 |
SLL | Logical shift left |
0111 |
SRL | Logical shift right |
1000 |
SRA | Arithmetic shift right |
1001 |
PASS | Pass operand B through |
1010 |
NOT | Bitwise NOT of operand B |
1011 |
MUL | Signed 32×32 multiplication |
1100 |
DIV | Signed division (quotient) |
1101 |
MOD | Signed modulus (remainder) |
All functional units are instantiated simultaneously and compute results in parallel. The top-level ALU mux selects among them based on op.
File: rtl/fast_adders.v
The add/subtract unit uses a 32-bit Kogge-Stone adder (fast_ksa32). This is a parallel prefix adder that computes all carry signals in O(log₂N) logic stages, yielding minimal critical path delay compared to a ripple-carry or carry-lookahead design.
- Subtraction is performed via 2's complement:
bis inverted andcin=1is asserted whenop == SUB. - A 64-bit variant (
fast_ksa64) is also included for use inside the multiplier's final accumulation stage.
Files: rtl/multiplier.v, rtl/encoder.v, rtl/tree_reducer.v
The multiplier (comb_mult32x32) uses a three-stage pipeline of combinational logic:
-
Radix-4 Booth Encoding (
radix4_booth_encoder): Encodes the 32-bit multiplier into 16 partial products (each 64-bit wide). This halves the number of partial products compared to a naive approach, reducing tree height. -
Dadda Tree Reduction (
dadda_tree_reducer): Compresses the 16 partial products into two 64-bit operands (sum + carry) using cascaded 3:2 carry-save adders (CSAs). CSAs eliminate carry propagation during compression. -
Final Kogge-Stone Addition (
kogge_stone_adder_64bit): Adds the two 64-bit outputs from the Dadda tree using the fast KS adder, producing the 64-bit signed product. The lower 32 bits are returned as the ALU result.
File: rtl/divider.v
The divider (signed_divider32) uses a non-restoring division algorithm operating on absolute values:
- Sign bits are extracted and the result sign is computed (
num_sign XOR den_sign). - The non-restoring loop iterates 32 times, conditionally adding or subtracting the divisor based on the sign of the running remainder.
- A correction step restores the remainder if it ends negative.
- Final sign correction is applied to both quotient and remainder.
- Division by zero returns
0for both outputs.
Note: The iterative loop unrolls fully in combinational synthesis, making this a single-cycle operation at the cost of area.
File: rtl/shifter.v
The barrel shifter (shifter32_opt) supports all three shift modes via a single unified 5-stage mux chain:
- SLL (Logical Left Shift): Input bits are reversed before the shift stages, then reversed again at output — converting a right-shift network into a left-shift with zero hardware duplication.
- SRL (Logical Right Shift): Standard right shift, filling with
0. - SRA (Arithmetic Right Shift): Right shift filling with the sign bit (
input[31]).
Each of the 5 stages conditionally shifts by 1, 2, 4, 8, or 16 positions based on each bit of shift_amt[4:0].
File: rtl/slt.v
The Set-Less-Than unit (slt32_opt) computes signed comparison by subtracting B from A using a Kogge-Stone adder and examining the result sign bit, with overflow detection to handle mixed-sign edge cases correctly:
overflow = (sign_a != sign_b) AND (sign_diff == sign_b)
result = sign_diff XOR overflow
SimpleRISC-ALU/
├── rtl/ # Synthesizable RTL source files
│ ├── alu.v # Top-level ALU (integrates all units)
│ ├── fast_adders.v # 32-bit and 64-bit Kogge-Stone adders + CSA
│ ├── multiplier.v # 32×32 Booth multiplier top-level
│ ├── encoder.v # Radix-4 Booth encoder (16 partial products)
│ ├── tree_reducer.v # Dadda tree partial product reducer
│ ├── divider.v # 32-bit signed non-restoring divider
│ ├── shifter.v # 32-bit barrel shifter (SLL/SRL/SRA)
│ ├── slt.v # Signed set-less-than comparator
│ ├── simplerisc_top.v # SimpleRISC processor top-level
│ ├── control_unit.v # Instruction decoder / control logic
│ ├── regfile.v # 32×32 register file
│ ├── imem.v # Instruction memory
│ ├── immu.v # Instruction memory management unit
│ └── decode.vh # Decode macros / opcode definitions
├── tb/ # Testbenches
│ ├── tb_alu.v # Standalone ALU testbench
│ └── tb_simplerisc.v # Full SimpleRISC core testbench
├── tools/
│ └── asm.py # SimpleRISC assembler (Python)
├── docs/
│ └── COA2_Design_Report.docx # Full design report
├── program.asm # Assembly test program
├── program.hex # Assembled hex for instruction memory
├── Makefile # Build and simulation targets
└── .gitignore
- Icarus Verilog (
iverilog,vvp) for simulation - GTKWave for waveform viewing (optional)
- Python 3.x for the assembler
Install on Ubuntu/Debian:
sudo apt install iverilog gtkwave python3# Compile and run full SimpleRISC simulation
make run
# Build only (no run)
make build
# View waveform (requires GTKWave)
make wave
# Clean build artifacts
make cleanTo run the standalone ALU testbench:
iverilog -g2012 -o alu_tb.vvp tb/tb_alu.v rtl/alu.v && vvp alu_tb.vvpUse the provided Python assembler to convert .asm → .hex:
python3 tools/asm.py program.asm program.hexThe resulting program.hex is loaded into instruction memory by the testbench.
The program.asm file includes the following test operations, covering all major ALU paths:
| Instruction | Operation | Expected Behaviour |
|---|---|---|
div r4, r1, r2 |
DIV | -103 / 10 = -10 (signed quotient) |
mod r5, r1, r2 |
MOD | -103 mod 10 = -3 (signed remainder) |
mul r6, r1, r2 |
MUL | -103 × 10 = -1030 |
add r7, r1, r2 |
ADD | -103 + 10 = -93 |
sub r8, r1, r2 |
SUB | -103 - 10 = -113 |
asr r9, r3, #4 |
SRA | -1 >> 4 = -1 (arithmetic, sign-extended) |
lsr r10, r3, #4 |
SRL | -1 (0xFFFF) >> 4 = 0x0FFFFFFF |
and r11, r1, r2 |
AND | Bitwise AND of -103 and 10 |
or r12, r1, r2 |
OR | Bitwise OR of -103 and 10 |
not r13, r1 |
NOT | Bitwise NOT of -103 |
Additional custom test cases to add (minimum 5 required per assignment spec):
- SLT with equal operands:
slt r0, r5, r5→ expectr0 = 0 - SLT with negative < positive:
slt r0, r1, r2(−103 < 10) → expectr0 = 1 - XOR self-clear:
xor r0, r3, r3→ expectr0 = 0,zero = 1 - MUL overflow check: multiply two large values, verify lower 32 bits only
- DIV by zero: divisor = 0, expect quotient = 0 (safe default behaviour)
| Design Choice | Benefit | Cost |
|---|---|---|
| Kogge-Stone adder (O(log N) depth) | Meets 250 MHz; minimal carry chain delay | Higher fan-out, more wiring |
| Radix-4 Booth encoding | Halves partial products (32→16) | Encoder logic overhead |
| Dadda tree vs Wallace tree | Slightly lower gate count | Marginally more complex routing |
| Fully unrolled combinational divider | Single-cycle latency | Large area; long critical path for DIV/MOD |
| Parallel instantiation of all units | Clean mux-select structure; synthesizer optimizes unused paths | All units always powered |
The divider has the longest critical path of any functional unit. If 250 MHz is not met post-synthesis, consider pipelining the divider (2–4 stages) while keeping all other units single-cycle.
Computer Organization and Architecture — Assignment 2