Skip to content

devanshjoshi08/RISCV-RV32IM-Processor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RISC-V RV32IM Pipelined Processor

A 6-stage pipelined RISC-V processor implementing the RV32IM instruction set in SystemVerilog, deployed on a Digilent Basys 3 FPGA (Xilinx Artix-7 XC7A35T) with clean timing closure at 100 MHz.

The processor supports all 48 RV32IM instructions with a pipelined hardware multiplier and iterative divider, M-mode privileged architecture (CSR access, trap handling, MRET), a gshare branch predictor with branch target buffer and return address stack, a direct-mapped instruction cache, 3-source data forwarding, and 64-bit hardware performance counters for cycle-accurate IPC measurement. Runs bare-metal C programs compiled with a standard RISC-V GCC toolchain, communicating over UART at 115200 baud with LED output on the FPGA. Validated through a 24-point comprehensive test suite, 37-test riscv-tests ISA compliance, and hardware deployment.

This project was built independently, outside of any course requirement.

Author: Devansh Joshi

Synthesis Results (Artix-7 XC7A35T, Vivado 2025.2)

Metric Value
Clock 100 MHz constraint, timing met with WNS = +0.135 ns (critical path = 9.87 ns → f_max ≈ 101.4 MHz)
Slice LUTs 6,805 / 20,800 (33%)
Slice Registers 8,412 / 41,600 (20%)
DSP48E1 12 (pipelined 32x32 → 64-bit multiplier)
LUTRAMs 512 (4 KB data memory)
BRAM 0
Critical path Gshare PHT read → PC next mux (7 logic levels, 9.87 ns)
Verification 24/24 comprehensive test, 37/37 riscv-tests compliance
FPGA validation Fibonacci demo running on hardware, UART + LEDs

Design Evolution: 5-Stage → 6-Stage Pipeline

I started with the standard Patterson & Hennessy 5-stage pipeline (IF/ID/EX/MEM/WB). With just the base RV32I instruction set, the design ran at approximately 91 MHz on Artix-7, with a WNS of -0.938 ns against the 100 MHz target.

That changed when I extended the processor with the M extension, M-mode privileged CSRs, trap handling, and a gshare branch predictor. All of these features converge in the execute stage: the forwarding unit compares source addresses against three pipeline stages and selects through a multi-level mux, which then feeds into the ALU carry chain, branch comparator, CSR read path, and result selection logic, all within a single cycle. After place-and-route, this path measured 15.5 ns across 19 logic levels, failing timing with a WNS of -5.539 ns and capping the design at roughly 64 MHz.

A 64 MHz processor on a 100 MHz target wastes 36% of every clock period in dead slack. Worse, shipping a design that doesn't meet its own timing constraint means risking functional failures under PVT variation on real hardware.

To fix this, I split the execute stage into two stages, similar to the approach used in ARM Cortex-M4 and RISC-V Ibex:

  • EX1 (Forwarding + Operand Select): 3-source forwarding comparison and mux, ALU operand selection (rs1/PC, rs2/immediate), CSR write-data preparation. Output registered into the EX1/EX2 pipeline register.
  • EX2 (ALU + Branch + CSR + MDU): registered operands feed directly into the ALU, branch unit, MDU, and CSR unit with no preceding combinational logic.

This reduced the critical path from 19 logic levels to 7, closing timing at 100 MHz with +0.135 ns of slack. The tradeoff is that mispredictions now flush 3 stages instead of 2, adding one cycle of branch latency. But a pipelined processor's performance is determined by throughput, not single-instruction latency. Throughput = IPC × frequency, and the 56% frequency gain (64 → 101 MHz) far exceeds the minor IPC reduction from the occasional extra flush cycle. The gshare predictor with BTB and RAS further minimizes the IPC impact by predicting both direction and target in the fetch stage, so the deeper pipeline rarely pays the full 3-cycle penalty in practice.

The simulation numbers from the comprehensive test suite (which includes 8 multi-cycle divides at 33 cycles each, making it a worst-case workload for IPC) confirm this:

Metric 5-stage @ 64 MHz 6-stage @ 101 MHz
Frequency 64.3 MHz 101.4 MHz
Cycles (comprehensive test) 329 342
Instructions retired 99 87
IPC (measured, divide-heavy) 0.30 0.25
Throughput (MIPS) 19.3 25.4
Throughput improvement +31%

The IPC dropped from 0.30 to 0.25 (17%) due to the deeper pipeline and additional stall cycles around multi-cycle operations, but the frequency increased by 56%, yielding a net throughput gain of 31% even on a divide-heavy workload. On integer-only code without divides, the IPC difference between the two designs is much smaller (the extra pipeline stage only costs cycles on mispredictions and load-use hazards), and the throughput advantage of the 6-stage design would be closer to the full 56%.

Pipeline Architecture

6-Stage Pipeline Diagram

IF (Instruction Fetch)

The PC feeds a 64-line direct-mapped instruction cache. Hits return combinationally; misses pull from the backing ROM and fill the line in one cycle. A three-component branch predictor runs in parallel:

  • Gshare PHT (64 entries, 2-bit saturating counters) predicts direction by indexing with PC[7:2] XOR a 6-bit global history register
  • BTB (32 entries, direct-mapped) supplies the predicted target address, storing tag + target + entry type (branch/JAL/call/return)
  • RAS (4-entry circular stack) predicts return targets for JALR instructions

If the BTB hits and the PHT says taken (or the entry is a JAL/call/return), the PC redirects speculatively to the predicted target. Mispredictions are detected in EX2 and cause a 3-cycle flush.

ID (Instruction Decode)

The instruction is decoded into control signals for all RV32IM + SYSTEM instructions. The immediate generator reassembles sign-extended immediates from all 5 RISC-V formats (I/S/B/U/J). The register file (32x32 with write-through bypass from WB) reads both source operands.

Call/return heuristics detect JAL/JALR instructions targeting link registers (x1 or x5) and push the return address onto the RAS.

EX1 (Forwarding + Operand Select)

Three forwarding sources resolve RAW data hazards without stalling:

Source Distance Data
EX2 1 instruction ahead ALU result, LUI immediate, or CSR read (combinational, gated by MDU valid)
MEM 2 instructions ahead ALU result, CSR data, PC+4, or load data from dmem
WB 3 instructions ahead Final write-back value

Priority: EX2 > MEM > WB (most recent result wins). Loads and in-progress MDU operations in EX2 are excluded from forwarding; the hazard unit stalls instead.

The forwarding mux also has a fresh register file read path with WB bypass. This handles the edge case where an instruction is stalled in EX1 for many cycles (during a multi-cycle divide) and the pipelined register file value goes stale because the source register was written by an instruction that has since left WB. Without this path, the processor would silently read an outdated value after the stall releases. This took a while to find.

After forwarding, the ALU input muxes select between the forwarded rs1/PC (for AUIPC) and forwarded rs2/immediate (for I-type). The CSR write data is also prepared here (either the forwarded rs1 value or the zero-extended zimm field for immediate CSR variants).

EX2 (Execute)

The registered operands from EX1 feed directly into the ALU and branch unit with no preceding combinational logic, keeping this stage fast.

ALU: 10 operations (ADD, SUB, AND, OR, XOR, SLT, SLTU, SLL, SRL, SRA).

MDU (Multiply/Divide Unit): 2-cycle pipelined multiplier for MUL/MULH/MULHSU/MULHU (operands registered on cycle 1, multiplication on cycle 2, infers 12 DSP48 blocks on Artix-7). 32-cycle iterative restoring divider for DIV/DIVU/REM/REMU with pipeline stall. Handles all edge cases per the RISC-V spec: division by zero returns quotient = -1 and remainder = dividend; signed overflow (MIN_INT / -1) returns MIN_INT with remainder 0.

Branch unit: evaluates all 6 branch conditions (BEQ, BNE, BLT, BGE, BLTU, BGEU). Branch target = PC + immediate. JALR target = (rs1 + immediate) & ~1. Misprediction detection compares actual outcome against the prediction carried through the pipeline.

CSR unit: atomic read-modify-write for all 6 CSR instructions. Holds the M-mode register file (mstatus, mie, mtvec, mscratch, mepc, mcause, mtval, mip) plus 64-bit performance counters (mcycle, minstret, and two custom HPM counters for branch statistics).

Trap detection: catches illegal instructions, ECALL, and EBREAK. On a trap: flush the pipeline (IF, ID, EX1), save PC to mepc, write cause to mcause, disable interrupts (MIE → MPIE), redirect PC to mtvec. MRET reverses the process: restore MIE from MPIE, redirect PC to mepc.

MEM (Memory Access)

The MMIO controller routes accesses by address:

Address Range Peripheral
0x00000000 - 0x00000FFF 4 KB data RAM (byte/half/word addressable with sign extension)
0x10000000 16-bit LED output
0x10000004 16-bit switch input
0x10000008 UART TX data (write a byte to transmit)
0x1000000C UART TX busy flag

The UART runs at 115200 baud, 8N1, implemented as a shift-register transmitter with a busy-wait interface.

WB (Write Back)

The write-back mux selects the final result for the register file:

Condition Source
JAL / JALR PC + 4 (return address)
Load Data memory read
CSR instruction CSR read value
Everything else ALU result (includes LUI)

The register file write port includes a write-through bypass: if WB writes a register in the same cycle that ID reads it, the new value is forwarded directly, avoiding a stale read.

Hazard Handling

The hazard unit manages three types of pipeline hazards:

Load-use: if the instruction in EX2 is a load and the instruction in EX1 reads the same register, the data won't be available until after the memory read. The hazard unit stalls PC, IF/ID, and ID/EX1 for one cycle and flushes EX1/EX2, inserting a bubble. On the next cycle, the load data is available from MEM for forwarding.

MDU stall: multiply takes 2 cycles, divide takes up to 33 cycles. While the MDU is computing, the entire pipeline from EX2 back is stalled (PC, IF/ID, ID/EX1, EX1/EX2 all hold). EX2/MEM receives bubbles (suppressed control signals) to prevent intermediate results from propagating. When the MDU signals valid, the stall releases and the result is forwarded.

Control hazards: branch mispredictions, JAL, JALR, traps, and MRET all flush the three stages behind EX2 (IF, ID, EX1). The gshare predictor reduces misprediction frequency; the BTB eliminates the target computation penalty for correctly-predicted branches.

Bugs Found and Fixed

Three non-obvious bugs came up during development that are worth documenting because they're the kind of thing that passes basic tests and only shows up under specific pipeline conditions.

Register file write-through bypass. Early in the 5-stage design, the third instruction in a dependent sequence would silently read a stale register value. Instructions A and B had forwarding coverage, but instruction C read from the register file in the same cycle that A's result was being written back. The register file wasn't bypassing the write data to the read port, so C got the old value. The fix was a combinational write-through: if WB is writing the same register that ID is reading on the same cycle, forward the write data directly. This bug never showed up in isolated instruction tests because it required three back-to-back dependencies with specific pipeline timing.

Stale forwarding after multi-cycle stalls. After adding the 6-stage pipeline, the divider stall (up to 33 cycles) would cause an instruction waiting in EX1 to lose its forwarded operand. The value was correct on the first cycle of the stall (forwarded from WB), but by the time the stall released, the source instruction had long since left WB and the pipelined register file value in ID/EX1 was captured before the write happened. The fix was to add a fresh register file read path in EX1 with WB bypass, so the default (no-forward) case always gets the current value rather than the stale pipelined one.

MDU result forwarding timing. The pipelined multiplier takes 2 cycles, but the EX2→EX1 forwarding path was providing the MDU result combinationally on the first cycle, before the multiplication had completed. This meant an instruction immediately following a MUL would forward garbage. The fix was to gate the EX2 forwarding on mdu_valid: only forward from EX2 for M-extension instructions when the MDU has actually produced a result. When the result isn't ready, the hazard unit stalls instead.

M Extension

All 8 RV32M instructions are implemented in hardware:

Instruction Latency Implementation
MUL 2 cycles Pipelined: register operands on cycle 1, DSP48 multiply on cycle 2
MULH 2 cycles Signed x signed, upper 32 bits of 64-bit product
MULHSU 2 cycles Signed x unsigned, upper 32 bits
MULHU 2 cycles Unsigned x unsigned, upper 32 bits
DIV 33 cycles Iterative restoring divider, signed, pipeline stalls
DIVU 33 cycles Iterative restoring divider, unsigned
REM 33 cycles Remainder from signed division
REMU 33 cycles Remainder from unsigned division

The multiplier is pipelined to enable clean DSP48 inference on Artix-7. Each DSP48E1 block handles a 25x18 signed multiply natively; a full 32x32 → 64-bit multiply is decomposed across multiple blocks. The divider uses a standard restoring algorithm: shift the dividend left one bit per cycle, trial-subtract the divisor, and build up the quotient bit by bit. Signed division takes absolute values of both operands, divides unsigned, then negates the result based on the original signs.

Division by zero is handled per the RISC-V spec without trapping: quotient = all ones (-1 signed), remainder = the dividend. Signed overflow (0x80000000 / -1) returns 0x80000000 with remainder 0.

Privileged Architecture

The processor implements M-mode (machine mode) from the RISC-V Privileged Specification:

CSR Address Function
mstatus 0x300 Global interrupt enable (MIE), previous interrupt enable (MPIE)
mie 0x304 Per-source interrupt enable mask
mtvec 0x305 Trap vector base address
mscratch 0x340 Scratch register for trap handler use
mepc 0x341 PC of the instruction that caused the trap
mcause 0x342 Trap cause code
mtval 0x343 Additional trap information (faulting instruction/address)
mip 0x344 Pending interrupts

All 6 CSR instructions are supported: CSRRW, CSRRS, CSRRC (register operand) and CSRRWI, CSRRSI, CSRRCI (5-bit zero-extended immediate). ECALL, EBREAK, and MRET are fully implemented with correct pipeline flushing and state save/restore.

Performance Counters

Four 64-bit hardware counters tick automatically and are readable via CSR instructions:

Counter CSR Address What it counts
mcycle 0xB00 / 0xB80 Clock cycles since reset
minstret 0xB02 / 0xB82 Instructions retired (committed at MEM stage)
mhpmcounter3 0xB03 / 0xB83 Branch mispredictions
mhpmcounter4 0xB04 / 0xB84 Total branches executed

These enable IPC measurement and predictor tuning from software:

unsigned int c0, c1, i0, i1;
asm volatile("csrr %0, mcycle" : "=r"(c0));
asm volatile("csrr %0, minstret" : "=r"(i0));
// ... workload ...
asm volatile("csrr %0, mcycle" : "=r"(c1));
asm volatile("csrr %0, minstret" : "=r"(i1));
// IPC = (i1 - i0) / (c1 - c0)

Verification

The processor is tested at four levels, from unit to system:

Level Testbench What it covers Result
1 rv32i_tb.sv Single-cycle reference: every instruction in isolation PASS
2 rv32i_pipeline_tb.sv Pipeline correctness: sum 1-to-10, exercises forwarding, load-use stalls, branch misprediction. Expected: x1 = x5 = 55, mem[0] = 55 PASS
3 rv32i_comprehensive_tb.sv 24-point test covering M-ext (all 8 ops + edge cases), all 6 CSR instructions, trap handling (ecall → handler → mret → resume), performance counters, and pipeline hazard forwarding from mul to dependent add. Automated PASS/FAIL with summary 24/24 PASS
4 rv32i_riscv_tests_tb.sv Official riscv-tests compliance suite: 37 rv32ui-p-* tests covering every RV32I instruction with corner cases PASS

The comprehensive test (level 3) is designed to catch the subtle bugs that simple tests miss:

  • Divide-by-zero and signed overflow: verifies the MDU returns spec-compliant results for 7/0, MIN_INT/-1
  • CSR read-modify-write atomicity: writes 0xDEADBEEF to mscratch, reads it back, then chains CSRRS → CSRRC → CSRRWI → CSRRSI → CSRRCI, verifying each intermediate value
  • Trap round-trip: sets mtvec, triggers ecall, handler reads mcause (expects 11), advances mepc past the ecall, executes mret, verifies execution resumes at the correct PC
  • Forwarding under stall: multiply followed by an immediately dependent add, verifying EX2 → EX1 forwarding produces the correct result even with MDU pipeline latency

A GitHub Actions CI workflow runs the pipeline and comprehensive testbenches on every push using Icarus Verilog.

FPGA Demo

The instruction ROM ships with a precompiled Fibonacci program. When deployed on the Basys 3:

  • The serial terminal (115200 baud) prints F(0) through F(19) as each value is computed
  • The LEDs show the lower 16 bits of the current Fibonacci number
  • After completion, LEDs hold F(19) = 4181 = 0x1055 (LEDs 0, 2, 4, 6, 12 lit)
  • Pressing the center button resets the processor and reruns the program

A separate perf_report.c program (compilable with the RISC-V toolchain) runs the same Fibonacci workload and then prints cycle count, instruction count, IPC, total branches, mispredictions, and mispredict rate over UART.

Building

Vivado Simulation

cd <project-dir>
source create_project.tcl
set_property top rv32i_comprehensive_tb [get_filesets sim_1]
launch_simulation
run 2ms

Compiling C Programs

Requires the xPack RISC-V GCC toolchain (download).

cd programs/c
make fibonacci
make perf_report

Programs compile with -march=rv32im_zicsr, generating hardware multiply/divide and CSR instructions.

FPGA Deployment

  1. Open the project in Vivado (source create_project.tcl)
  2. Ensure fpga_top is the synthesis top
  3. Run synthesis, implementation, and generate bitstream
  4. Program the Basys 3 over JTAG
  5. Open a serial terminal at 115200 baud on the FPGA's COM port
  6. Press the center button to reset and run

Running riscv-tests

git clone https://github.com/riscv-software-src/riscv-tests
cd riscv-tests && git submodule update --init --recursive
autoconf && ./configure --prefix=$PWD/install && make && make install
cd <project-dir>
bash tools/run_riscv_tests.sh ./riscv-tests

File Structure

rtl/
  pkg_riscv.sv                type definitions, opcodes, CSR addresses, exception codes
  pc.sv                       program counter with write enable
  imem.sv                     instruction ROM (1024 x 32, preloaded with fibonacci)
  icache.sv                   64-line direct-mapped instruction cache
  regfile.sv                  32x32 register file with write-through bypass
  control.sv                  main decoder for RV32IM + SYSTEM instructions
  imm_gen.sv                  immediate extraction for all 5 RISC-V formats
  alu.sv                      10-operation arithmetic/logic unit
  mdu.sv                      pipelined multiplier + iterative divider (RV32M)
  csr_unit.sv                 M-mode CSR register file with performance counters
  branch_unit.sv              6-condition branch evaluator
  branch_predictor.sv         gshare PHT + direct-mapped BTB + return address stack
  forwarding_unit.sv          3-source RAW hazard forwarding (EX2, MEM, WB)
  hazard_unit.sv              stall/flush control for 6-stage pipeline
  pipe_if_id.sv               IF/ID pipeline register (with stall + flush)
  pipe_id_ex.sv               ID/EX1 pipeline register (with stall + flush)
  pipe_ex1_ex2.sv             EX1/EX2 pipeline register (with stall + flush)
  pipe_ex_mem.sv              EX2/MEM pipeline register
  pipe_mem_wb.sv              MEM/WB pipeline register
  mmio.sv                     memory-mapped I/O controller (RAM + LEDs + switches + UART)
  dmem.sv                     4 KB data memory (byte/half/word with sign extension)
  uart_tx.sv                  115200 baud 8N1 UART transmitter
  rv32i_top.sv                single-cycle reference implementation
  rv32i_pipeline_top.sv       6-stage pipelined processor (simulation)
  rv32i_pipeline_mmio_top.sv  6-stage pipelined processor with MMIO (FPGA)
  fpga_top.sv                 FPGA wrapper with reset synchronizer

tb/
  rv32i_tb.sv                 single-cycle testbench
  rv32i_pipeline_tb.sv        pipeline testbench (sum 1-to-10)
  rv32i_comprehensive_tb.sv   24-point comprehensive test (M-ext, CSR, traps, hazards)
  rv32i_mext_csr_tb.sv        targeted M-extension + CSR test
  rv32i_riscv_tests_tb.sv     riscv-tests compliance harness

programs/asm/
  sum_1_to_10.s               forwarding + branch validation program
  test_mext_csr.s             M-extension + CSR test program
  test_comprehensive.s        full-coverage test (hand-encoded with verified hex)

programs/c/
  Makefile                    cross-compilation for rv32im_zicsr
  link.ld                     linker script (ROM 0x0000-0x0FFF, stack at top)
  start.s                     bare-metal startup (set SP, call main, halt)
  mmio.h                      hardware register definitions and UART drivers
  fibonacci.c                 Fibonacci sequence with UART + LED output
  bubble_sort.c               array sort with UART output
  perf_report.c               runs fibonacci then prints IPC and branch stats

constraints/
  basys3.xdc                  Basys 3 pin assignments (100 MHz clock, LEDs, switches, UART TX)

tools/
  hex_disasm.py               hex-to-assembly disassembler
  run_riscv_tests.sh          automated riscv-tests runner

.github/workflows/
  sim.yml                     CI: runs pipeline + comprehensive testbenches on every push

References

About

6-stage pipelined RV32IM processor in SystemVerilog, closing timing at 100 MHz on Artix-7 after evolving from a 5-stage design capped at 64 MHz. Pipelined hardware multiplier, iterative divider, M-mode trap handling, gshare branch predictor with BTB and RAS, 64-bit performance counters. 24-point verified and deployed on Basys 3 FPGA.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors