Black Mesa Labs — GeistHaus

BML Designing RISC-V SoCs with FPGAs : Part-Femto-CPU C Blinky

kevinhub88 Sep 14, 2025 Updated Sep 14, 2025

2025.09.13 : I’m Kevin Hubbard, Electrical Engineer. I’ve spent my 30+ year career designing embedded systems using ASICs, FPGAs, and embedded CPUs. It’s been an amazing journey that I hope others will pursue. I’m giving back now in writing this “BML Designing RISC-V SoCs with FPGAs” series which starts here. The previous chapter enhanced the simple […]

Show full content

The previous chapter enhanced the simple three instruction Femto CPU by adding two more RISC-V instructions for reading (“lw” or “Load Word”) and writing to memory (“sw” or “Store Word”). A short four DWORD assembly language program then looped, reading the value of switches mapped to RISC-V high memory space ( 0x10000004 ) and writing those values to LEDs mapped to RISC-V high memory space ( 0x10000008 ).

This chapter will move past assembly language and demonstrate using the C programming language to compile a simple “Blinky” program to flash the LEDs. The Verilog files from the previous chapter will be re-used and enhanced for this.

Step one is to install a cross-compiler. What’s a cross-compiler? It allows an engineer to write software on one computer and target a completely different CPU architecture. Oftentimes this means running GCC on an 80×86 Linux workstation and targeting an embedded CPU like the RISC-V. Someday we may all have RISC-V Linux workstations on our desks, but not today.

Installing GCC for RISC-V on my Ubuntu 22.04 LTS Linux workstation was super simple.

%sudo apt install gcc-riscv64-unknown-elf binutils-riscv64-unknown-elf

Compiling C code to a binary file takes a couple of steps:

Step-1 : Build a linker load file. The RISC-V CPU has a 32bit memory space ( ignoring the 64bit RISC-V for the moment ). The Compiler needs to know that portions of that memory space are populated with actual memory. An embedded system, as an example, might have Read-Only Flash memory for machine code instructions and also SRAM for storing variables, and large memory allocations ( mallocs() in C parlance ).

[ link.ld ]
MEMORY {
  FLASH (rx) : ORIGIN = 0x10000000, LENGTH = 256K
  RAM   (rwx): ORIGIN = 0x20000000, LENGTH = 64K
}

SECTIONS {
  .text : {
    *(.text)
  } > FLASH

  .data : {
    *(.data)
  } > RAM

  .bss : {
    *(.bss)
  } > RAM
}

The “.text” section refers to executable code. Confusing, right? section gets its name from historical Unix and compiler conventions, where “text” referred to executable code – not readable ASCII characters as we might assume today.

The “.data” section refers to initialized data – meaning variables.

The “.bss” section refers to uninitialized data – meaning variables with no startup default values.
The “.heap” section is for memory allocations – aka mallocs().
The “.stack” – is well, the Stack. The Stack is a linear data structure that follows the Last In, First Out (LIFO) principle—meaning the last item added is the first one removed. It’s used for local variables and return addresses for function calls.
A final section ( not shown ) is “.rodata” – or Read-Only data. Constants – or variables which can not be changed.

For this Femto-CPU design, the linker load file is fairly simple, a very small RAM.

[ link.ld ]
ENTRY(_start)

MEMORY {
RAM (rwx) : ORIGIN = 0x00000000, LENGTH = 64 – 0
}

SECTIONS {
.stack (NOLOAD) :
{
_stack_end = ORIGIN(RAM) + LENGTH(RAM);
_stack_start = _stack_end – 0x0010; /* 16 Byte */
} > RAM
. = ORIGIN(RAM); /* Reset counter to start of RAM */
.text : { *(.text.init) *(.text) } > RAM
.rodata : { *(.rodata) } > RAM
.data : { *(.data) } > RAM
.bss (NOLOAD) : { _bss_start = .; *(.bss) _bss_end = .; } > RAM
.heap (NOLOAD) : { _heap_start = .; } > RAM
PROVIDE(_sp = _stack_end);
}

The C code is quite simple. Count an unsigned integer named cnt in a forever loop and write the value to the LED peripheral at 0x10000008 using the C pointer led_ptr.

[ main.c ]
typedef unsigned int uint32_t;// stdint.h not available
volatile uint32_t* led_ptr = (uint32_t*)0x10000008;
int main() 
{
  uint32_t cnt = 0;
  while (1) 
  {
    cnt++;
    *led_ptr = cnt;
  }
 return 0;
}

Compiling is just a single command line. To check things, we will compile to assembly 1st. Using the -O2 optimization simplifies the design as to not require use of the stack pointer.

riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 -O2 -T link.ld -S main.c -o main.s

Which reveals a definite problem. The compiles program uses RISC-V registers “a4” and “a5”, not the temporary registers “t0” and “t1” that the Femto-CPU implement.

	.file	"main.c"
	.option nopic
	.attribute arch, "rv32i2p0"
	.attribute unaligned_access, 0
	.attribute stack_align, 16
	.text
	.section	.text.startup,"ax",@progbits
	.align	2
	.globl	main
	.type	main, @function
main:
	lui	a5,%hi(led_ptr)
	lw	a4,%lo(led_ptr)(a5)
	li	a5,0
.L2:
	addi	a5,a5,1
	sw	a5,0(a4)
	j	.L2
	.size	main, .-main
	.globl	led_ptr
	.section	.sdata,"aw"
	.align	2
	.type	led_ptr, @object
	.size	led_ptr, 4
led_ptr:
	.word	0x10000008
	.ident	"GCC: () 10.2.0"

No need to panic though. This just leads to the next lab assignment. The assignment is to enhance the femto_cpu.v Verilog file to add registers “a4” (x14) and “a5” (x15) to the existing instructions. A warning that x14 and x15 does not mean 0x14 and 0x15 hexadecimal, but rather 14 and 15 decimal.

Compiling the C code to a ROM *.bin file takes a couple of steps.

[ go.sh ]
# Compile *.C to *.elf
riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 -O2 -nostartfiles -nostdlib -ffreestanding -T link.ld -o main.elf main.c
# Convert ELF to Raw Binary (Optional)
riscv64-unknown-elf-objcopy -O binary main.elf main.bin

Once that first task is completed, update the femto_wrom.v to contain the compiled C program. There are two choices to get the five DWORDs. Option-1 is to copy and paste the assembly code from main.s into the online assembler.

After compiling the assembly code, click on the “Disassembly” tab to see the six DWORDs of machine code in RAM starting at address 0x00000000.

Option-2 is to hexdump the main.bin file that GCC generated. Unfortunately hexdump utility dumps 16 bit WORDs instead of 32 bit DWORDs.

0000000 2703 0140 0793 0000 8793 0017 2023 00f7
0000010 f06f ff9f 0008 1000

By passing a custom format string, it is possible to get just a dump of DWORDs in proper order.

%hexdump -e '8/4 "%08x "' -e '"\n"' main.bin
01402703 00000793 00178793 00f72023 ff9ff06f 10000008

The final part of the lab assignment is to simulate femto_core.v with the modified femto_cpu.v and femto_wrom.v files include.

%vsim femto_core

The force file doesn’t do much other than provide a clock and toggle reset.

[ force_femto_core.do ]
force clk 0 5 ns, 1 10 ns -repeat 10 ns
force reset 1; run 20 ns;
force reset 0; run 300 ns;

The wave file lists the memory interface and the internal CPU registers.

[ wave_femto_core.do ]
add wave /femto_core/clk
add wave /femto_core/reset
add wave -radix hex /femto_core/led
add wave -divider {Memory Bus}
add wave -radix unsigned /femto_core/u_femto_cpu/bus_addr
add wave -radix hex /femto_core/u_femto_cpu/bus_wr_en
add wave -radix hex /femto_core/u_femto_cpu/bus_wr_d
add wave -radix hex /femto_core/u_femto_cpu/bus_rd_d
add wave -divider {Memory Cells}
add wave -radix hex {/femto_core/u_femto_wrom/wrom_array[0]}
add wave -radix hex {/femto_core/u_femto_wrom/wrom_array[1]}
add wave -radix hex {/femto_core/u_femto_wrom/wrom_array[2]}
add wave -radix hex {/femto_core/u_femto_wrom/wrom_array[3]}
add wave -radix hex {/femto_core/u_femto_wrom/wrom_array[4]}
add wave -radix hex {/femto_core/u_femto_wrom/wrom_array[5]}
add wave -radix hex {/femto_core/u_femto_wrom/wrom_array[6]}
add wave -radix hex {/femto_core/u_femto_wrom/wrom_array[7]}
add wave -divider {CPU Registers}
add wave -radix unsigned /femto_core/u_femto_cpu/pc
add wave -radix hex /femto_core/u_femto_cpu/t0
add wave -radix hex /femto_core/u_femto_cpu/t1
add wave -radix hex /femto_core/u_femto_cpu/a4
add wave -radix hex /femto_core/u_femto_cpu/a5
add wave -radix binary /femto_core/u_femto_cpu/opcode
add wave -radix binary /femto_core/u_femto_cpu/funct3
add wave -radix hex /femto_core/u_femto_cpu/rd
add wave -radix hex /femto_core/u_femto_cpu/rs1
add wave -radix hex /femto_core/u_femto_cpu/rs2

If your modifications to femto_cpu.v and femto_wrom.v were correct, the simulation should show the LED array mapped to high address 0x10000008 incrementing by +1 every six clock cycles.

If your simulation looks good, go ahead and build an FPGA bitfile and watch the LEDs flash in real hardware. The following files will build the design targeting the Digilent Artix-7 BASYS3 board.

[ go.sh ]
vivado -mode batch -source go.tcl

[ go.tcl ]
set design_name top
set device      xc7a35tcpg236-1
set_part $device
set rep_dir ./reports ; file mkdir $rep_dir
set tmp_dir ./temp    ; file mkdir $tmp_dir
source top_rtl_list.tcl
read_xdc ./${design_name}_timing.xdc
synth_design -top $design_name -part $device -fsm_extraction off
report_timing_summary -file post_synth_timing_summary.rpt
read_xdc ./${design_name}_physical.xdc
opt_design
place_design
route_design
$rep_dir/post_route_timing_worst.rpt
report_timing_summary -file $rep_dir/post_route_timing_summary.rpt
report_utilization -file $rep_dir/post_route_util.rpt
report_power -file $rep_dir/post_route_pwr.rpt
set_property BITSTREAM.GENERAL.COMPRESS TRUE [current_design]
write_bitstream -force ${design_name}.bit
exit

[ top_rtl_list.tcl ]
read_verilog ../src/top.v
read_verilog ../src/femto_core.v
read_verilog ../src/femto_cpu.v
read_verilog ../src/femto_wrom.v

[ top_timing.xdc ]
create_clock -period 10.000 -name clk_100m -waveform {0.000 5.000} [get_ports clk_100m_pin]

[ top_physical.xdc ]
set_property -dict { PACKAGE_PIN W5   IOSTANDARD LVCMOS33 } [get_ports clk]
create_clock -add -name sys_clk_pin -period 10.00 -waveform {0 5} [get_ports clk]

## Switches
set_property -dict { PACKAGE_PIN V17   IOSTANDARD LVCMOS33 } [get_ports {sw[0]}]
set_property -dict { PACKAGE_PIN V16   IOSTANDARD LVCMOS33 } [get_ports {sw[1]}]
set_property -dict { PACKAGE_PIN W16   IOSTANDARD LVCMOS33 } [get_ports {sw[2]}]
set_property -dict { PACKAGE_PIN W17   IOSTANDARD LVCMOS33 } [get_ports {sw[3]}]
set_property -dict { PACKAGE_PIN W15   IOSTANDARD LVCMOS33 } [get_ports {sw[4]}]
set_property -dict { PACKAGE_PIN V15   IOSTANDARD LVCMOS33 } [get_ports {sw[5]}]
set_property -dict { PACKAGE_PIN W14   IOSTANDARD LVCMOS33 } [get_ports {sw[6]}]
set_property -dict { PACKAGE_PIN W13   IOSTANDARD LVCMOS33 } [get_ports {sw[7]}]
set_property -dict { PACKAGE_PIN V2    IOSTANDARD LVCMOS33 } [get_ports {sw[8]}]
set_property -dict { PACKAGE_PIN T3    IOSTANDARD LVCMOS33 } [get_ports {sw[9]}]
set_property -dict { PACKAGE_PIN T2    IOSTANDARD LVCMOS33 } [get_ports {sw[10]}]
set_property -dict { PACKAGE_PIN R3    IOSTANDARD LVCMOS33 } [get_ports {sw[11]}]
set_property -dict { PACKAGE_PIN W2    IOSTANDARD LVCMOS33 } [get_ports {sw[12]}]
set_property -dict { PACKAGE_PIN U1    IOSTANDARD LVCMOS33 } [get_ports {sw[13]}]
set_property -dict { PACKAGE_PIN T1    IOSTANDARD LVCMOS33 } [get_ports {sw[14]}]
set_property -dict { PACKAGE_PIN R2    IOSTANDARD LVCMOS33 } [get_ports {sw[15]}]


## LEDs
set_property -dict { PACKAGE_PIN U16   IOSTANDARD LVCMOS33 } [get_ports {led[0]}]
set_property -dict { PACKAGE_PIN E19   IOSTANDARD LVCMOS33 } [get_ports {led[1]}]
set_property -dict { PACKAGE_PIN U19   IOSTANDARD LVCMOS33 } [get_ports {led[2]}]
set_property -dict { PACKAGE_PIN V19   IOSTANDARD LVCMOS33 } [get_ports {led[3]}]
set_property -dict { PACKAGE_PIN W18   IOSTANDARD LVCMOS33 } [get_ports {led[4]}]
set_property -dict { PACKAGE_PIN U15   IOSTANDARD LVCMOS33 } [get_ports {led[5]}]
set_property -dict { PACKAGE_PIN U14   IOSTANDARD LVCMOS33 } [get_ports {led[6]}]
set_property -dict { PACKAGE_PIN V14   IOSTANDARD LVCMOS33 } [get_ports {led[7]}]
set_property -dict { PACKAGE_PIN V13   IOSTANDARD LVCMOS33 } [get_ports {led[8]}]
set_property -dict { PACKAGE_PIN V3    IOSTANDARD LVCMOS33 } [get_ports {led[9]}]
set_property -dict { PACKAGE_PIN W3    IOSTANDARD LVCMOS33 } [get_ports {led[10]}]
set_property -dict { PACKAGE_PIN U3    IOSTANDARD LVCMOS33 } [get_ports {led[11]}]
set_property -dict { PACKAGE_PIN P3    IOSTANDARD LVCMOS33 } [get_ports {led[12]}]
set_property -dict { PACKAGE_PIN N3    IOSTANDARD LVCMOS33 } [get_ports {led[13]}]
set_property -dict { PACKAGE_PIN P1    IOSTANDARD LVCMOS33 } [get_ports {led[14]}]
set_property -dict { PACKAGE_PIN L1    IOSTANDARD LVCMOS33 } [get_ports {led[15]}]

## Configuration options, can be used for all designs
set_property CONFIG_VOLTAGE 3.3 [current_design]
set_property CFGBVS VCCO [current_design]

## SPI configuration mode options for QSPI boot, can be used for all designs
set_property BITSTREAM.GENERAL.COMPRESS TRUE [current_design]
set_property BITSTREAM.CONFIG.CONFIGRATE 33 [current_design]
set_property CONFIG_MODE SPIx4 [current_design]

This ends the chapter on using the C programming language to program the Femto CPU core. The minimal Femto CPU core has been a valuable educational tool in learning RISC-V assembly and C programming. With only five RISC-V instructions implemented, Femto CPU is of little practical use beyond these educational tutorials. The next chapter in this series will introduce the open-source Hazard3 RISC-V core which supports all 47 RISC-V instructions.

http://blackmesalabs.wordpress.com/?p=3086

Extensions

BML Designing RISC-V SoCs with FPGAs : Part-Femto-CPU Memory Access

kevinhub88 Sep 7, 2025 Updated Sep 8, 2025

2025.09.07 : I’m Kevin Hubbard, Electrical Engineer. I’ve spent my 30+ year career designing embedded systems using ASICs, FPGAs, and embedded CPUs. It’s been an amazing journey that I hope others will pursue. I’m giving back now in writing this “BML Designing RISC-V SoCs with FPGAs” series which starts here. The previous chapter introduced a […]

Show full content

The previous chapter introduced a very simple CPU core capable of executing only 3 of 47 RISC-V machine code instructions. Those three are just enough to increment a 32 bit register in a loop. This chapter enhances femto_cpu.v to include bus read and write access and provides an example of mapping hardware peripherals ( Switches and LEDs ) to RISC-V memory space.

The new component femto_core.v will map 16 switches to address 0x10000004 and 16 LEDs to address 0x10000008. It will also map femto_wrom.v to address 0x00000000. What’s a WROM? Well it’s a writable-ROM, or sometimes known as a pre-initialized RAM. The wrom.v will initialize with the machine code, but can also be overwritten by the CPU after boot.

The assembly program to run is quite simple, read the switches and write their values out to the LEDs. A CPU definitely isn’t required for this, but it does exercise both bus reads and writes to external hardware peripherals. This program requires both the t0 and t1 RISC-V temporary registers and also two new instructions, lw (Load Word) and sw (Store Word) to access the switches and LEDs.

lui t0, 0x10000        # Load base address 0x10000000 into t0
loop:    lw  t1, 4(t0) # Load word from 0x10000004 into t1
sw  t1, 8(t0)          # Store word t1 to 0x10000008
j loop                 # Jump back two DWORDs

For now, rather than using GCC cross-compiler tools, I’m using this excellent online assembler and simulator for compiling my assembly code into machine code. Single stepping the software in the online simulator and verifying it matches my Verilog simulations in ModelSim is extremely beneficial.

Clicking “Compile and Load” will display 4 DWORDs of the machine opcode instructions.

With this information, I then type in my ROM/RAM component. Note the 4 DWORDs match the assembler’s output. Keeping the memory to just 16 DWORDs is helpful as the online assembler/simulator tool can only display so much information on a single web page.

[ femto_wrom.v ]
module femto_wrom
(
  input wire        clk,
  input wire        wen,
  input wire        ren,
  input wire [31:0] addr,
  input wire [31:0] wdata,
  output reg [31:0] rdata
);

  reg  [31:0]   wrom_array[16-1:0];

//---------------------------------
// Initialize ROM at configuration
//---------------------------------
initial
begin
  wrom_array[8'h00 / 4 ] = 32'h100002b7;
  wrom_array[8'h04 / 4 ] = 32'h0042a303;
  wrom_array[8'h08 / 4 ] = 32'h0062a423;
  wrom_array[8'h0C / 4 ] = 32'hff9ff06f;
end

//---------------------------------
// RAM
//---------------------------------
always @( posedge clk )
begin
  if ( wen == 1 ) begin
    wrom_array[ addr[5:2] ] <= wdata[31:0];
  end
  if ( ren == 1 ) begin
    rdata <= wrom_array[ addr[5:2] ];
  end
end // always

endmodule // femto_wrom.v

The top level file femto_core.v stitches the memory to the cpu and also hooks up the 16 switches and 16 LEDs. I broke it into two parts, the module stitching in part-1.

[ femto_core.v 1of2 ]
module femto_core
(
  input  wire        clk,
  input  wire        reset,
  input  wire [15:0] sw,
  output wire [15:0] led
);

  wire        bus_wr_en;
  wire [31:0] bus_addr;
  reg  [31:0] bus_addr_p1;
  wire [31:0] bus_wr_d;
  reg  [31:0] bus_rd_d;
  reg  [15:0] led_loc;
  wire [31:0] wrom_rd_d;
  reg         wrom_rd_en;
  reg         wrom_wr_en;
  reg  [15:0] sw_p1;

femto_cpu u_femto_cpu
(
  .clk       ( clk            ),
  .reset     ( reset          ),
  .bus_wr_en ( bus_wr_en      ),
  .bus_addr  ( bus_addr[31:0] ),
  .bus_wr_d  ( bus_wr_d[31:0] ),
  .bus_rd_d  ( bus_rd_d[31:0] )
);

femto_wrom u_femto_wrom
(
  .clk       ( clk             ),
  .ren       ( wrom_rd_en      ),
  .wen       ( wrom_wr_en      ),
  .addr      ( bus_addr[31:0]  ),
  .wdata     ( bus_wr_d[31:0]  ),
  .rdata     ( wrom_rd_d[31:0] )
);

The second part contains the memory mapping to the switches and LEDs. It also includes a readback mux so that the CPU can read from either the memory or the switches.

[ femto_core.v 2of2 ]
//----------------------------------
// Read mux
//----------------------------------
always @ (posedge clk) begin
  bus_addr_p1 <= bus_addr[31:0];
  sw_p1       <= sw[15:0];
end

always @ ( * ) begin
  if ( bus_addr_p1[31:28] == 4'h0 ) begin
    bus_rd_d   <= wrom_rd_d[31:0];
  end else if ( bus_addr_p1[31:28] == 4'h1 ) begin
    if ( bus_addr_p1[31:0] == 32'h10000004 ) begin
      bus_rd_d   <= { 16'd0, sw_p1[15:0] };
    end
  end else begin
    bus_rd_d   <= 32'hXXXXXXXX;
  end
end

//----------------------------------------------
// Write-Only Hardware Peripheral mapped up high
//----------------------------------------------
always @( posedge clk )
begin
  if ( bus_wr_en == 1 && bus_addr == 32'h10000008 ) begin
    led_loc <= bus_wr_d[15:0];  // Sims
//  led_loc <= bus_wr_d[31:16]; // Hardware
  end
end // always
  assign led = led_loc[15:0];

//---------------------------------------------
// Mux the strobes based on MSB address nibble
//---------------------------------------------
always @( * )
begin
  if ( bus_addr[31:28] == 4'h0 ) begin
    wrom_rd_en <= 1;
    wrom_wr_en <= bus_wr_en;
  end else begin
    wrom_rd_en <= 0;
    wrom_wr_en <= 0;
  end
end // always

endmodule

Changes to the CPU itself are mostly adding the two new instructions lw ( Load Word ) and sw ( Store Word ). The pipeline got a bit more complicated, so extra logic was added for that as well.

[ femto_cpu.v Two new instructions ]
      // S-Type
      end else if ( opcode == 7'b0100011 ) begin
        // sw ( Store Word )
        if ( funct3 == 3'b010 ) begin
          pc          <= pc[31:0] - 4;// Dont inc PC
          pipe_stall  <= 1;
          pipe_stall2 <= 1;
          sw_addr <= rs1_muxd[31:0]+{{20{s_imm[11]}},s_imm[11:0]};
          wdata   <= rs2_muxd;
          bus_wr  <= 1;
        end

      // I-Type
      end else if ( opcode == 7'b0000011 ) begin
        // lw ( Load Word )
        if ( funct3 == 3'b010 ) begin
          pc          <= pc[31:0] - 4;// Dont inc PC
          pipe_stall  <= 1;
          pipe_stall2 <= 1;
          rd_lw   <= rd[4:0];// Remember who the read goes to
          sw_addr <= rs1_muxd[31:0] + {{20{i_imm[11]}},i_imm[11:0]};
          bus_rd  <= 1;
        end

Here is the CPU in its entirety, well – broken up into 3 parts anyways.

[ femto_cpu.v 1of3 ]
module femto_cpu
(
  input  wire        clk,
  input  wire        reset,
  output wire [31:0] bus_addr,
  output wire        bus_wr_en,
  output wire [31:0] bus_wr_d,
  input  wire [31:0] bus_rd_d
);

  reg         pipe_stall;
  reg         pipe_stall2;
  reg  [31:0] pc;
  reg  [31:0] t0;
  reg  [31:0] t1;
  wire [6:0]  opcode;
  wire [4:0]  rd;
  wire [4:0]  rs1;
  wire [4:0]  rs2;
  wire [2:0]  funct3;
  wire [11:0] s_imm;
  wire [11:0] i_imm;
  wire [20:0] j_imm;
  wire [19:0] u_imm;
  wire [31:0] rdata;
  reg  [31:0] wdata;
  reg  [31:0] sw_addr;
  reg         bus_wr;
  reg         bus_rd;
  reg         bus_rd_p1;
  reg  [31:0] rs1_muxd;
  reg  [31:0] rs2_muxd;
  reg  [4:0]  rd_lw;
  reg  [4:0]  rd_lw_p1;
  reg         reset_p1;


  assign bus_addr  = (bus_wr==1 || bus_rd==1 ) ? sw_addr[31:0]:pc[31:0];
  assign bus_wr_en = bus_wr;
  assign bus_wr_d  = wdata[31:0];
  assign rdata     = bus_rd_d[31:0];

Part 2 of 3 is the actual decoder logic with the new instructions added.

[ femto_cpu.v 2of3 ]
always @ (posedge clk) begin
  if ( reset == 1 ) begin
    pc          <= 32'd0;// 0x00000040 for real RISC-V
    t0          <= 32'd0;
    t1          <= 32'd0;
    pipe_stall  <= 1;
    pipe_stall2 <= 0;
    bus_wr      <= 0;
    bus_rd      <= 0;
    bus_rd_p1   <= 0;
    rd_lw       <= 6'd0;
    rd_lw_p1    <= 6'd0;
    wdata       <= 32'd0;
    reset_p1    <= 1;
  end else begin
    reset_p1    <= 0;
    pipe_stall2 <= 0;
    pipe_stall  <= pipe_stall2;
    bus_wr      <= 0;
    bus_rd      <= 0;
    bus_rd_p1   <= bus_rd;
    rd_lw       <= 6'd0;
    rd_lw_p1    <= rd_lw;
    wdata       <= 32'd0;
    pc          <= pc[31:0] + 4;
    if ( pipe_stall == 0 ) begin
      // I-Type
      if ( opcode == 7'b0010011 ) begin
        // add
        if ( funct3 == 3'b000 ) begin
          case ( rd[4:0] )
            5'h05 : t0 <= rs1_muxd + {{20{i_imm[11]}},i_imm[11:0]};
            5'h06 : t1 <= rs1_muxd + {{20{i_imm[11]}},i_imm[11:0]};
          endcase
        end

      // U-Type : lui ( Load Upper Immediate )
      end else if ( opcode == 7'b0110111 ) begin
        case ( rd[4:0] )
          5'h05   : t0 <= { u_imm[19:0], 12'd0 };
          5'h06   : t1 <= { u_imm[19:0], 12'd0 };
        endcase

      // S-Type
      end else if ( opcode == 7'b0100011 ) begin
        // sw ( Store Word )
        if ( funct3 == 3'b010 ) begin
          pc          <= pc[31:0] - 4;// Dont inc PC
          pipe_stall  <= 1;
          pipe_stall2 <= 1;
          sw_addr <= rs1_muxd[31:0] + {{20{s_imm[11]}},s_imm[11:0]};
          wdata   <= rs2_muxd;
          bus_wr  <= 1;
        end
      // I-Type
      end else if ( opcode == 7'b0000011 ) begin
        // lw ( Load Word )
        if ( funct3 == 3'b010 ) begin
          pc          <= pc[31:0] - 4;// Dont inc PC
          pipe_stall  <= 1;
          pipe_stall2 <= 1;
          rd_lw   <= rd[4:0];// Remember who the read goes to
          sw_addr <= rs1_muxd[31:0] + {{20{i_imm[11]}},i_imm[11:0]};
          bus_rd  <= 1;
        end

      // J-Type Jump
      end else if ( opcode == 7'b1101111 ) begin
        if ( rd == 5'h00 ) begin
          pc <= pc -4 + {{11{j_imm[20]}}, j_imm[20:0] };//s20->s31
          pipe_stall <= 1;
        end
      end
    end else begin
      if ( bus_rd_p1 == 1 ) begin
        case ( rd_lw_p1[4:0] )
          5'h05   : t0 <= rdata[31:0];
          5'h06   : t1 <= rdata[31:0];
        endcase
      end
    end
  end
end

And finally Part 3of3 :

[ femto_cpu.v 3of3 ]
  assign opcode = rdata[6:0];  // Base Opcode
  assign rd     = rdata[11:7]; // Destination Register
  assign funct3 = rdata[14:12];// Function Code
  assign rs2    = rdata[24:20];// Source Register
  assign rs1    = rdata[19:15];// Source Register
  assign i_imm  = rdata[31:20];
  assign j_imm  = { rdata[31], rdata[19:12], rdata[20], rdata[30:21], 1'b0 };
  assign u_imm  = rdata[31:12];
  assign s_imm  = { rdata[31:25], rdata[11:7]};

// rs1 Source mux
always @ ( * ) begin
  case ( rs1 )
    5'h00   : rs1_muxd <= 32'd0;
    5'h05   : rs1_muxd <= t0[31:0];
    5'h06   : rs1_muxd <= t1[31:0];
    default : rs1_muxd <= 32'd0;
  endcase
end

// rs2 Source mux
always @ ( * ) begin
  case ( rs2 )
    5'h00   : rs2_muxd <= 32'd0;
    5'h05   : rs2_muxd <= t0[31:0];
    5'h06   : rs2_muxd <= t1[31:0];
    default : rs2_muxd <= 32'd0;
  endcase
end

endmodule
`default_nettype wire // enable Verilog default

Simulating is quite simple, requiring only this very short ModelSim do file.

[ force.do ]
force clk 0 5 ns, 1 10 ns -repeat 10 ns
force sw 16#aaaa
force reset 1; run 20 ns;
force reset 0; run 300 ns;

Viewing the simulation it is quite clear that the memory accesses to the Switches and LEDs result in many pipeline stalls.

Lab Assignment : Enhance the assembly counter program from the previous chapter to write 16 bits of the count value to the 16 LEDs.
Question : Why should the 16 LEDs be mapped to the CPU’s D[15:0] bits for simulation and the D[31:16] bits for actual FPGA operation?

This ends the Femto CPU core module on memory access. For the next chapter in the series, see “BML Designing RISC-V SoCs with FPGAs” introduction here.

http://blackmesalabs.wordpress.com/?p=3056

Extensions

BML Designing RISC-V SoCs with FPGAs : Part-Femto-CPU Counter

kevinhub88 Sep 6, 2025 Updated Sep 7, 2025

2025.09.06 : I’m Kevin Hubbard, Electrical Engineer. I’ve spent my 30+ year career designing embedded systems using ASICs, FPGAs, and embedded CPUs. It’s been an amazing journey that I hope others will pursue. I’m giving back now in writing this “BML Designing RISC-V SoCs with FPGAs” series which starts here. The previous chapter explained RISC-V […]

Show full content

The previous chapter explained RISC-V assembly language and machine code. This chapter introduces a very simple CPU core which is capable of executing only 3 of 47 RISC-V machine code instructions. Those three are just enough to increment a 32 bit register in a loop. Eventually this “Femto-CPU” will flash a blinky LED.

The following assembly language code requires just three DWORDs of memory. Three DWORDs of memory for executing three instructions. That’s RISC in a nutshell. The ability to execute a new instruction every clock cycle is a speed advantage RISC has over CISC. Instruction opcodes and data are often packed together into a single 32bit DWORD.

mv t0,zero          # Move zero into t0 register
loop: addi t0,t0,1  # Add 1 to t0 register
j loop              # Jump back 1 DWORD to address of label "loop"

Using an assembler such as this on-line RISC-V assembler , the assembly compiles into three machine language instructions.

0x00000000 : 0x00000293 # Move zero into t0 register
0x00000004 : 0x00128293 # Add 1 to t0 register
0x00000008 : 0xffdff06f # Jump back 1 DWORD to address 0x00000004

From here, it’s easy enough to build an inferrable ROM in Verilog just by hand. This ROM is 4×32 even though it decodes as 1Gx32. It has single-clock latency, meaning that the data is available one clock after the address has changed. This is very important knowledge for the instruction decoder. Addresses are Byte oriented, but DWORD aligned, so each instruction is +4 bytes from the previous.

[ femto_rom.v ]
module femto_rom
(
  input wire        clk,
  input wire [31:0] addr,
  output reg [31:0] rdata
);

always @ (posedge clk) begin
  case ( addr )
    32'h00  : rdata <= 32'h00000293;// mv t0, zero   
    32'h04  : rdata <= 32'h00128293;// addi t0, t0, 1
    32'h08  : rdata <= 32'hffdff06f;// j -4          
    default : rdata <= 32'd0;
  endcase
end

endmodule

A top level Verilog structural file then hooks up the CPU to the ROM. There is no RAM in this design (yet), so the write bus is left dangling.

[ femto_core.v ]
module femto_core
(
  input  wire        clk,
  input  wire        reset
);

  wire        bus_wr_en;
  wire [31:0] bus_addr;
  wire [31:0] bus_wr_d;
  wire [31:0] bus_rd_d;
  
femto_cpu u_femto_cpu
(
  .clk       ( clk            ),
  .reset     ( reset          ),
  .bus_wr_en ( bus_wr_en      ),
  .bus_addr  ( bus_addr[31:0] ),
  .bus_wr_d  ( bus_wr_d[31:0] ),
  .bus_rd_d  ( bus_rd_d[31:0] )
);

femto_rom u_femto_rom
(
  .clk       ( clk            ),
  .addr      ( bus_addr[31:0] ),
  .rdata     ( bus_rd_d[31:0] )
);

endmodule // femto_core.v

The actual CPU core itself is where things get interesting. Part-1 is just the regular header stuff and wire and register definitions. The CPU takes in a clock and reset and has a memory bus interface out.

[ femto_cpu.v 1of3 ]
module femto_cpu
(
  input  wire        clk,
  input  wire        reset,
  output wire [31:0] bus_addr,
  output wire        bus_wr_en,
  output wire [31:0] bus_wr_d,
  input  wire [31:0] bus_rd_d
);

  reg         pipe_stall;
  reg  [31:0] pc;
  reg  [31:0] t0;
  wire [6:0]  opcode;
  wire [2:0]  funct3;
  wire [4:0]  rd;
  wire [4:0]  rs1;
  wire [4:0]  rs2;
  wire [11:0] i_imm;
  wire [20:0] j_imm;
  wire [19:0] u_imm;
  wire [31:0] rdata;
  reg  [31:0] rs1_muxd;

  assign bus_addr  = pc[31:0];
  assign bus_wr_en = 0;
  assign bus_wr_d  = 32'd0;
  assign rdata     = bus_rd_d[31:0];

Part-2 is the actual instruction decoder. RISC-V wants to execute a new instruction every clock cycle. The assumption is that the PC (Program Counter) will increment +4 bytes ( 1 DWORD ) every clock and one clock later that next instruction will be available for decoding. When this doesn’t happen, like in a Jump instruction, the decoder pipeline must be stalled until the next valid instruction is available from the ROM. When the decoder decodes an assembly Jump instruction, it immediately asserts pipe_stall, forcing the decoder to wait until the pipeline has caught up.

[ femto_cpu.v 2of3 ]
always @ (posedge clk) begin
  if ( reset == 1 ) begin
    pipe_stall  <= 1;
    pc          <= 32'd0;// 0x00000040 for real RISC-V
    t0          <= 32'd0;
  end else begin
    pipe_stall  <= 0;
    pc          <= pc + 32'd4;// Default Behavior
    if ( pipe_stall == 0 ) begin
      // I-Type
      if ( opcode == 7'b0010011 ) begin
        // add
        if ( funct3 == 3'b000 ) begin
          case ( rd[4:0] )
            5'h05 : t0 <= rs1_muxd + {{20{i_imm[11]}},i_imm[11:0]};
          endcase
        end
      // U-Type : lui ( Load Upper Immediate )
      end else if ( opcode == 7'b0110111 ) begin
        case ( rd[4:0] )
          5'h05 : t0 <= { u_imm[19:0], 12'd0 };
        endcase
      // J-Type Jump
      end else if ( opcode == 7'b1101111 ) begin
        if ( rd == 5'h00 ) begin
          pc <= pc -4 + {{11{j_imm[20]}}, j_imm[20:0]};//s20->s31
          pipe_stall <= 1;
        end
      end
    end
  end
end

Part-3 contains a mux for rs1 source selection and also does the instruction bit-unpacking of a 32bit DWORD from ROM into the various instruction fields. Note that the immediate bits are packed differently depending on the instruction type ( I, J, and U ). The rs1 mux supports either adding t0 to t0 during an add instruction, or adding zero to t0 – effectively zeroing out the t0 register.

  assign opcode = rdata[6:0];  // Base Opcode
  assign funct3 = rdata[14:12];// Function Code
  assign rs1    = rdata[19:15];// Source Register
  assign rs2    = rdata[24:20];// Source Register
  assign rd     = rdata[11:7]; // Destination Register
  assign i_imm  = rdata[31:20];
  assign j_imm  = {rdata[31],rdata[19:12],rdata[20],rdata[30:21],1'b0};
  assign u_imm  = rdata[31:12];

// rs1 Source mux
always @ ( * ) begin
  case ( rs1 )
    5'h00   : rs1_muxd <= 32'd0;
    5'h05   : rs1_muxd <= t0[31:0];
    default : rs1_muxd <= 32'd0;
  endcase
end

endmodule // femto_cpu.v

That’s it. That’s the Femto-CPU in its entirety. Simulation only requires a very simple ModelSim do file.

force clk 0 5 ns, 1 10 ns -repeat 10 ns
force reset 1; run 10 ns;
force reset 0; run 100 ns;

This simulation runs for 100 ns and shows the t0 register incrementing +1 three times in a loop from PC 0x00000008 back to 0x00000004.

Lab Assignment : Enhance the femto_cpu.v design to add the RISC-V t1 register. Once this is done, add on to the assembly language program to clear t1 to zero at startup and increment t1 by +=3 every loop cycle. Compile and simulate the program using the on-line simulator. Using the compiled machine code, modify femto_rom.v Verilog file to add the two new t1 instructions. Simulate the design using ModelSim and observe t0 is incrementing by +=1 and t1 by +=3 every loop cycle.

That’s a quick introduction to the Femto CPU core for executing three RISC-V assembly instructions. For the next chapter in the series, see “BML Designing RISC-V SoCs with FPGAs” introduction here.

http://blackmesalabs.wordpress.com/?p=3031

Extensions

BML Designing RISC-V SoCs with FPGAs : Part-RISC-V Assembly Language

kevinhub88 Sep 1, 2025 Updated Sep 1, 2025

2025.09.01 : I’m Kevin Hubbard, Electrical Engineer. I’ve spent the majority of my 30+ year career designing digital logic circuits in ASICs and FPGAs. It’s been an amazing journey that I hope others will pursue. I’m giving back now in writing this “BML Designing RISC-V SoCs with FPGAs” series which starts here. This chapter explains […]

Show full content

This chapter explains RISC-V assembly language. With Bare-Metal SoC development, a minimal amount of assembly language knowledge is required.

CPUs, for the most part, don’t execute software programming languages like C or even assembly language. CPUs execute machine code, the lowest-level form of software instructions—a sequence of binary or hexadecimal values that a computer’s CPU can execute directly, without any translation or interpretation. Each instruction corresponds to a specific operation, like loading data, performing arithmetic, or jumping to another memory address. Machine code is architecture-specific: the machine code for x86, ARM, or RISC-V will differ because each CPU has its own instruction set.

The simplest RISC-V machine code program that I can think of is one that zeros out the 32bit t0 temporary register within the RISC-V CPU and then loops forever incrementing t0 by one. It’s a forever loop counter that counts from 0 to 4 billion and then rolls over back to 0 (ignoring signed math for the moment). This program only requires three DWORDs and looks like this:

0x00000040 : 0x00000293 # Move zero into t0 register
0x00000044 : 0x00128293 # Add 1 to t0 register
0x00000048 : 0xffdff06f # Jump back 1 DWORD to address 0x00000044

The comments are clear (to me anyways, I wrote them after all), but the machine code itself is quite cryptic. For this reason, Assembly language was invented. Assembly language is a human-readable abstraction of machine code, using mnemonics like MV (Move), ADD (Add), and J (Jump). The exact same program written in RISC-V Assembly language looks like this:

mv t0,zero # Move zero into t0 register
loop: addi t0,t0,1 # Add 1 to t0 register
j loop # Jump back 1 DWORD to address of label “loop”

The “addi t0,t0,1” requires a little bit of explanation. It translate to “add immediate t0 = t0 + 1” where 1 is the “immediate” value, a 12bit signed integer stored along with the opcode in the 32bit instruction. Alternative “add” instructions may involve adding a different register (t1 for example), or even a value stored in memory. With the “addi” immediate, there’s no memory and only the t0 register involved.

Breaking down “0x00128293 : addi t0, t0, 1” bit by bit it helps to view the DWORD in binary nibbles rather than hexadecimal:

              0x0  0x0  0x1  0x2  0x8  0x2  0x9  0x3
0x00128293 : 0000_0000_0001_0010_1000_0010_1001_0011

After that is done, separate the bits per the I-Type instruction format:

imm    : 000000000001
rs1    : 00101
funct3 : 000
rd     : 00101
opcode : 0010011

And then decode the bits. The base opcode 0010011 at D[6:0] means that this is 1 of 8 possible I-Type instructions. The opcode in RISC-V is the foundational 7-bit field that tells the processor what kind of instruction it’s dealing with. It’s the first clue in decoding any instruction. The funct3 of 000 further defines this instruction as the “addi”.

The addi instruction requires a source register (rs1), a destination register (rd), and an immediate value (imm). In addi t0, t0, 1 (i.e., t0 = t0 + 1), both rs1 and rd are set to 00101, which specifies the t0 register. The immediate value is 000000000001, representing 1. Together, these fields form a compact and efficient 32-bit machine code instruction.

This one instruction exemplifies the RISC concept. RISC processors are built around a small, highly optimized set of instructions that execute very quickly—often in a single clock cycle. Instead of complex, multi-step instructions (as in CISC: Complex Instruction Set Computer), RISC uses simple instructions that can be pipelined and parallelized more easily.

Lab Assignment : Use the https://cpulator.01xz.net/?sys=rv32 on-line RISC-V simulator to simulate the example assembly code. First compile and load the assembly into machine code and then step into the code, one instruction at a time. Observer the “pc” (Program Counter) register and “t0” change as the machine code executes.

That’s a quick introduction to RISC-V Assembly and Machine language. For the next chapter in the series, see “BML Designing RISC-V SoCs with FPGAs” introduction here.

http://blackmesalabs.wordpress.com/?p=3015

Extensions

BML Designing RISC-V SoCs with FPGAs : Part-GNU Cross-Compiler

kevinhub88 Aug 31, 2025 Updated Sep 1, 2025

2025.08.31 : I’m Kevin Hubbard, Electrical Engineer. I’ve spent the majority of my 30+ year career designing digital logic circuits in ASICs and FPGAs. It’s been an amazing journey that I hope others will pursue. I’m giving back now in writing this “BML Designing RISC-V SoCs with FPGAs” series which starts here. This chapter explains […]

Show full content

This chapter explains installing and using the GNU GCC Cross-Compiler tool-chain for compile bare-metal C code into RISC-V machine code.

At the beginning of time, CPUs were programmed in assembly language. Assembly is fine for small projects, and was easy enough to learn back in the 8-bit 6502 and Z-80 days of my 1970s/1980s youth. Assembly language programming is very fast, but also doesn’t scale well to large software projects. For that, we need C.

C is a general-purpose computer programming language developed in the early 1970s by Dennis Ritchie at Bell Labs. It is known for its efficiency and close-to-hardware capabilities, making it a popular choice for system programming, operating systems, embedded systems, and various applications.

As much as I love CircuitPython on the RP2040, it requires a lot of memory – something that a small FPGA does not have much of. Rust might be an option, but I’ve been writing low-level code in C since 1990, so C it will be. In this chapter, I will demonstrate a simple C program compiled to RISC-V machine language which requires only 36 bytes of memory. You read that right, readable C source to just five DWORDs of machine code. How’s that for hardware efficiency?

To develop in C for the RISC-V CPU we need a compiler that takes in C and generates RISC-V assembly as an output. A cross-compiler is a compiler that runs on one CPU architecture but generates code for a completely different CPU architecture. I don’t know about you, but my desktop runs on an 80×86, not a RISC-V, so I definitely need a 80×86 to RISC-V cross-compiler. GCC is the obvious choice.

The GNU Compiler Collection (GCC) was created by Richard Stallman for the GNU Project and first released in 1987 as the “GNU C Compiler” to provide a free and portable C compiler. Over time, it evolved to support multiple languages, becoming the “GNU Compiler Collection,” with significant updates including C++ support in 1992.

There are two ways to install a GCC cross-compiler. Method-1 is to build it from Source. This involves downloading all of the source files and compiling them. This takes considerable time and effort. Method-2 is installing a Prebuilt Toolchain. I run Ubuntu 22.04 LTS, installing is a simple as typing this from the command line:

sudo apt install gcc-riscv64-unknown-elf binutils-riscv64-unknown-elf

This process takes just a few minutes ( compared to a few hours for Method-1 ).

Quick test of the install is to ask for the version using the “–version” CLI flag:

khubbard@lambda:~/nas/blackmesa/c/riscv$ riscv64-unknown-elf-gcc --version
riscv64-unknown-elf-gcc () 10.2.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Now is as good time as any to explain what an “elf” is, and no, not the elf on a shelf. ELF stands for Executable and Linkable Format—a standard file format used for executables, object code, shared libraries, and core dumps on Unix-like systems. ELF files were introduced as part of System V Release 4 (SVR4) of the Unix operating system in 1989. This marked a major shift toward standardizing binary formats across Unix variants.

Time to compile some C code into RISC-V Assembly. This first program is a bit odd in that it uses a global volatile variable. The “volatile” keyword tells the compiler to not optimize it away ( since the variable is never actually read and used for anything, by default the compiler will strip away the code ). The “global” keyword tells the compiler it’s a global variable ( generally frowned upon unless you are writing BASIC and the year is 1978 ). I’m declaring it as a “global” as I want to avoid the Stack Pointer ( for now ). I’d prefer to just use a RISC-V temporary register (like “t0”), but GCC won’t let me.

[ counter.c ]
volatile int cnt = 0xAA;
int main()
{
  while (1)
  {
    cnt++;
  }
}

I also need a linker file, “link.ld”, which tells the compiler where to put things in RISC-V memory. I’m deliberately avoiding the complexity of having a mix of Read-Only (Flash) and RAM memory. Instead, all 64 Kbytes will be Read-Write RAM. Booting from ROM will be morning guy’s problem (for now).

[ link.ld ]
ENTRY(_start)

MEMORY {
  RAM (rwx) : ORIGIN = 0x00000040, LENGTH = 64K - 0x40
}

SECTIONS {
  .stack (NOLOAD) : 
    { 
      _stack_end = ORIGIN(RAM) + LENGTH(RAM); 
      _stack_start = _stack_end - 0x1000; /* 4 KB */
    } > RAM
  . = ORIGIN(RAM); /* Reset counter to start of RAM */
  .text :          { *(.text.init) *(.text) }                > RAM
  .rodata :        { *(.rodata) }                            > RAM
  .data :          { *(.data) }                              > RAM
  .bss (NOLOAD) :  { _bss_start = .; *(.bss) _bss_end = .; } > RAM
  .heap (NOLOAD) : { _heap_start = .; }                      > RAM
  PROVIDE(_sp = _stack_end);
}

What this seemingly cryptic linker file does is actually quite important: it defines how memory is allocated when your program is compiled. While a 32-bit CPU might theoretically access up to 4GB of memory, it’s highly unlikely that a $20 SoC FPGA will offer anything close to that. In my example design, I’ve allocated 64KB of RAM starting at address 0x0.

Based on the link.ld file, the compiler will assign the lowest region of memory to code (machine instructions), followed by initialized variables and any read-only data (think: string literals in print statements). These four regions of .text, .rodata, .data, .bss are static. At compile time they will require a certain amount of memory and that memory requirement won’t change once the code running.

Next comes “the heap”—the unused memory available for dynamic allocation via malloc. The compiler doesn’t know in advance how much .heap will be required. Finally, the topmost 4KB is statically reserved for “the stack.” The .stack region typically begins at the top of RAM and is statically allocated downward toward the heap. Why 4KB and not 1KB or 16KB for the stack? That depends entirely on your application’s needs and is definitely one of the challenges of bare-metal embedded development.

Static stack memory allocation comes with an important caveat: be cautious with recursion in bare-metal programming. In systems with only 4KB or 8KB of stack space, deep or uncontrolled recursion can quickly lead to a stack overflow. Unlike hosted environments, there’s no operating system to catch or recover from such a failure. Since the stack is statically allocated, a recursive function that consumes large stack frames or recurses too deeply can collide with the heap or other memory regions—and then it’s “Game Over, man.”

Compiling to assembly is now just a single command ( details on CLI flag options can be found here ):

riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 -O2 -T link.ld -S counter.c -o counter.s

Which spits out the Assembly code “counter.s”. I’ve culled some overhead info for brevity. Comments added are mine:

[ counter.s ]
main:                 # Label main(), start of the program
 lui a4,%hi(cnt)      # Load MSB 20 of cnt addr to a4 
.L2:                  # Label L2
 lw a5,%lo(cnt)(a4)   # Load a5 with value at cnt's 20+12 addr
 addi a5,a5,1         # Increment a5 by 1
 sw a5,%lo(cnt)(a4)   # Store a5 into cnt's 20+12 addr
 j .L2                # Jump to Label L2
cnt:                  # Declare global variable cnt 
 .word 0x000000AA     # Initialize cnt to 0x000000AA

Note how RISC-V has a 20+12 bit addressing mechanism with LUI (Load Upper Immediate) and ADDI (Add Immediate) instructions. While this may not be immediately apparent, this 20/12 address split mechanism enables PC-relative addressing—an essential feature for generating relocatable code.

Relocatable code refers to machine instructions that can be loaded and executed from different memory addresses without requiring changes to the code itself. While not essential in bare-metal programming, it remains a foundational concept in systems programming, linking, and operating systems.

For example, when you launch a program like NOTEPAD.EXE, it doesn’t always execute from the same memory location. Instead, it loads into the next available region of memory. As a result, all memory references within the compiled NOTEPAD.EXE executable—such as subroutine jumps and variable storage—must be relative to the program counter at launch, rather than hardcoded to an absolute base address like 0x00000000. For bare-metal programming, relocatable code doesn’t matter so much. RISC-V runs Linux though, so it’s a very important feature to have built into the hardware.

All of that overhead info from counter.s that I culled for brevity? I better go ahead and explain it all since it matters.

[ counter.s ]
 .file "counter.c"                     # Identify source for debugger
 .option nopic                         # NOT Position Independent Code
 .attribute arch, "rv32i2p0"           # Target arch RISC-V
 .attribute unaligned_access, 0        # Unaligned Mem access not allowed
 .attribute stack_align, 16            # Stack 16 byte aligned (ABI compliant)
 .text                                 # Code Section start
 .section .text.startup,"ax",@progbits #
 .align 2                              # Align 4-byte (RV32 reqd)
 .globl main                           # main() is global symbol and func
 .type main, @function                 # 
.. assembly code section here ..          
 .size main, .-main                    # Calculate size of main()
 .globl cnt                            # Declare cnt global in .sdata
 .section .sdata,"aw"                  #
 .align 2                              # Align cnt in 4-byte boundaries
 .type cnt, @object                    # Declare cnt as 4-byte object
 .size cnt, 4                           #

Now that we have examined the GCC generated Assembly code and have confidence it knows what it is doing, the next step is to compile C straight to *.ELF.

riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 -O2 -nostartfiles -nostdlib -ffreestanding -T link.ld -o counter.elf counter.c

ELF files are binary files and have a lot of info in them and aren’t suitable for display here. What I’d like to share is the actual machine code in hex that will go into RISC-V memory for execution. Thankfully GCC makes it easy to convert from *.ELF to *.BIN ( binary file ).

riscv64-unknown-elf-objcopy -O binary counter.elf counter.bin
hexdump counter.bin > counter.hex
more counter.hex

Which then spits out counter.hex to STDOUT:

[ counter.hex ]
0000000 2783 0500 8793 0017 2823 04f0 f06f ff5f
0000010 00aa 0000

The utility “hexdump” which is included with Linux displays data in 16bit WORD chunks even though we live (mostly) in a 32bit DWORD world. Thankfully this counter.c is small enough that I can transpose things around in my head and create a DWORD memory map:

0000040 : 05002783
0000044 : 00178793
0000048 : 04f02823
000004C : ff5ff06f
0000050 : 000000aa

And that’s the entire program. Four DWORDs of instructions (.text) and one DWORD for the variable “cnt” (.data). There’s no stack and no heap, so RAM footprint is tiny.

For a linker check, we can dump the symbols to STDOUT and confirm everything is where we expected it to be

%riscv64-unknown-elf-objdump -t -d counter.elf
counter.elf:     file format elf32-littleriscv
SYMBOL TABLE:
00000040 l    d  .text.startup	00000000 .text.startup
00000050 l    d  .sdata	00000000 .sdata
00000000 l    d  .comment	00000000 .comment
00000000 l    d  .riscv.attributes	00000000 .riscv.attributes
00000000 l    df *ABS*	00000000 counter2.c
00000054 g       .sdata	00000000 _bss_start
00000050 g     O .sdata	00000004 cnt
00000054 g       .sdata	00000000 _bss_end
00000054 g       .sdata	00000000 _heap_start
00000000         *UND*	00000000 _start
00000040 g     F .text.startup	00000010 main
00010000 g       .text.startup	00000000 _stack_end
0000f000 g       .text.startup	00000000 _stack_start

Disassembly of section .text.startup:
00000040 <main>:
  40:	05002783          	lw	a5,80(zero) # 50 <cnt>
  44:	00178793          	addi	a5,a5,1
  48:	04f02823          	sw	a5,80(zero) # 50 <cnt>
  4c:	ff5ff06f          	j	40 <main>

As a final check, I will copy and paste the assembly code from “counter.s” into an online RISC-V simulator ( https://cpulator.01xz.net/?sys=rv32 )

And click “Compile and Load”. What is noteworthy is the code gets offset from 0x00000040 to 0x00000000. The RISC-V Hazard3 core I will be using in hardware has a reset pointer to 0x00000040. This on-line simulator wants to start at 0x00000000 for some reason. Not a big deal, I just need to keep track of the difference ( 0x40 ). I’m sure there’s a way to fix this, I will try to circle back later :

Clicking “Step Into” will then execute the 3 lines of assembly one after the other. Clicking the “Memory” tab will show the incremented “a5” register content of 0x000000ab get stored into memory at address 0x00000010 just as anticipated. Cool huh?

Time for a dog backyard break. I will add more to this chapter later….. bml_khubbard 2025.08.31

I will close out this tool-chain chapter with a short example of make and Makefile. A Makefile is a build automation script used by the make utility to compile, link, and manage software projects—especially those with multiple source files and dependencies. It’s like a recipe book for building your code efficiently and intelligently. If you have 100 C source files in your design, make will only compile the file you have changed since your last compile instead of recompiling all 100. I’m note really a fan of make. I hate the fact that it requires <TAB> be used for the indent. In Makefile syntax, every command in a recipe must begin with a literal tab character. Not spaces, but an actual ASCII 0x09 TAB character. It’s maddening.

[ Makefile ]
RISCV_GCC     = riscv64-unknown-elf-gcc
RISCV_OBJDUMP = riscv64-unknown-elf-objdump
RISCV_OBJCOPY = riscv64-unknown-elf-objcopy

CFLAGS = -march=rv32im -mabi=ilp32 -O2 -nostartfiles -nostdlib -ffreestanding -Wall
LDFLAGS = -T link.ld

all: counter.elf

program.elf: counter.c link.ld
 $(RISCV_GCC) $(CFLAGS) $(LDFLAGS) counter.c -o $@
 $(RISCV_OBJCOPY) -O binary $@ counter.bin

clean:
 rm -f counter.elf counter.bin

A “make clean” will erase a previous build. A “make all” will rebuild the design. That’s make in a nutshell. Personally, I prefer Linux shell scripts.

That ends this chapter on the GNU GCC cross-compiler tool-chain for RISC-V. For the next chapter in the series, see “BML Designing RISC-V SoCs with FPGAs” introduction here.

http://blackmesalabs.wordpress.com/?p=2947

Extensions

BML Designing RISC-V SoCs with FPGAs : Part-Intro

kevinhub88 Aug 31, 2025 Updated Sep 14, 2025

Show full content

Table of Contents:
Part-1 : What are SoCs?
Part-2 : History of Modern Computer Architecture
Part-3 : RISC-V Assembly Language
Part-4 : RISC-V and the Femto core
Part-5 : Femto-CPU Memory Access
Part-6 : Femto-CPU Blinky in C
Part-7 : GNU cross-compiler
Part-8 : RISC-V and the KianV core
Part-9 : RISC-V and the Hazard3 core
Part-10 : Segger JTAG Debugger
Part-11 : Example Design – LED Blinky
Part-12: Example Design – UART Communications
Part-13 : Example Design – SPI Communications
Part-14 : Example Design – PWM Servo Controller
Part-15 : Example Design – VGA Graphics Controller
Part-16 : ? FreeRTOS ?

2025.08.31 : I’m Kevin Hubbard, BSEE. I’ve been designing with CPUs and digital logic for more than four decades. I got my start as a 1980s Radio Shack kid, scrounging dollar bills to buy the latest TTL logic chips in DIP packages—like the 7400 series: 7474, 74244, 74245, 74373, 74374, and others. I built little digital interfaces for my 8-bit 6502 Apple ][+ and Z80 TRS-80 Model I computers of that era. Everything back then ran at 1 MHz and 5V—very forgiving to breadboards, long wires, and missing bypass caps. Bypass caps? What were those for?

My hobbyist passion for electronics helped me survive high school and earn a BSEE degree in the early 1990s from the University of Washington. With zero engineers in the family, I just knew this was what I was meant to do. As a kid, my career ambitions were either electronics or becoming a professional LEGO builder. And get this—after graduation, companies actually paid me to design digital PCBs, FPGAs, and ASICs in the electronics industry. Still pinching myself over that. I still enjoy building with LEGOs too—just not professionally (yet).

More than 30 years later, my career isn’t quite sunsetting—but I’ve decided it’s a great time to start sharing some of the knowledge I’ve gathered along my journey in electronics. I spent a decade working on embedded designs with the HC11, 68K, and 80×86 platforms, doing both circuit board design and software development, before shifting to purely digital logic targeting ASICs and FPGAs. I’ve missed my embedded C roots from the 1990s.

Drawing on my experience, I’m planning a multi-part (1 of N) tutorial on RISC-V System-on-Chip (SoC) design. This open-source blog will be a quick and casual free-write—likely full of rough sentence structure and the occasional spelling mistake. My long-term goal is to refine the series into a 300-page manuscript and eventually publish it, just as I did with my 2024/2025 blog-to-book Mastering FPGA Chip Design: For Speed, Area, Power, and Reliability, which is available in both print and PDF formats.

Fingers crossed I make it that far on this RISC-V SoC journey. Everyone is more than welcome to follow along and provide feedback and chapter ideas. I may be reached @ bml (underbar) khubbard on X (Twitter).

http://blackmesalabs.wordpress.com/?p=2940

Extensions

BML FPGA Design Tutorial Part-16ofN : Test Benches

kevinhub88 Sep 29, 2024 Updated May 27, 2025

Show full content

2024.09.29 : I’m BSEE Kevin Hubbard from Seattle, WA. I design digital logic circuits. My journey started more than 40 years ago designing digital circuit boards. My first was a simple 5V TTL logic plug-in expansion board for my 8-bit Apple ][+ computer as a teenager in the 1980’s. In the 1990’s I eventually transitioned ( and got paid! ) to design chips – PALs, CPLDs, early sub-micron (350nm) FPGAs and eventually digital ASICs (250nm, 180nm ) during the LSI Logic heyday years of late 1990’s to early 2000’s.

I have circled back to FPGAs these days, but now deep sub-micron FPGAs. Digital logic gate arrays with billions of transistors just waiting for me to configure them. These modern FPGAs are easily 10x more complex than the sub-micron digital ASICs I designed decades ago. Quoting Mark Watney (Matt Damon) from “The Martian (2015)“, “I love what I do, and I’m really good at it.”

I would love to do this ( design digital logic chips ) forever, but we all have to die sometime. “I’ve seen things you people wouldn’t believe. All those moments will be lost in time, like tears in rain.” – Roy Batty. While I am still in the thick of it, I have decided to give back now with this FPGA design series. Every blog chapter is an early sneak preview from my planned 2025 book, “Mastering FPGA Chip Design : For Speed, Area, Power, and Reliability”. The (free) web blog series begins here.

The previous chapter was all about logic simulation. This chapter is all about test benches. So what is a “test bench” exactly?

A test bench is a controlled environment used to verify the correctness, performance, and reliability of a design or model, often in the context of electronic or software systems. It simulates real-world scenarios to evaluate how a system behaves before it is deployed.
In digital design, particularly with hardware description languages like Verilog or VHDL, a test bench is a piece of code that instantiates the device under test (DUT) and applies test vectors to it. This helps in verifying that the DUT functions as expected under various conditions
ChatGPT-4

Not bad ChatGTP-4, not bad at all.

A test bench may be used to verify that an RTL module ( or HDL netlist ) actually does what it is supposed to do. Test benches are typically more advanced than a simple “do file”. A “do file” only provides stimulus while a test bench may provide stimulus while also checking output results against expected results. When I think of test benches, I put them into three different categories of increasing levels of complexity.

Test Bench Complexity Levels :

Stimulus only with VCD Output
Test Vector Stimulus and Capture
Stimulus with Output Analysis

Stimulus only

The simplest type of test bench is essentially a ModelSim “DO File” but written in your favorite HDL ( VHDL, Verilog, SystemVerilog ). As an example design, this chapter will use a simple 4-bit loadable binary counter. In Verilog:

[ counter.v ]
`timescale 1 ns/ 100 ps
`default_nettype none // Strictly enforce all nets to be declared

module counter
(
  input  wire       reset,
  input  wire       clk,
  input  wire       load,
  input  wire [3:0] din,
  output wire [3:0] dout
);// module counter

  reg  [3:0]  my_cnt = 4'd0;

always @ ( posedge clk ) begin
  if ( reset == 1 ) begin
    my_cnt <= 4'd0;
  end else if ( load == 1 ) begin
    my_cnt <= din[3:0];
  end else begin
    my_cnt <= my_cnt[3:0] + 1;
  end
end
  assign dout = my_cnt[3:0];

endmodule // counter.v
`default_nettype wire // enable Verilog default for any 3rd party IP needing it

and in VHDL:

[ counter.vhd ]
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_arith.all;
use ieee.std_logic_unsigned.all;

entity counter is
port
(
  reset  : in  std_logic;
  clk    : in  std_logic;
  load   : in  std_logic;
  din    : in  std_logic_vector(3 downto 0);
  dout   : out std_logic_vector(3 downto 0)
);
end counter;

architecture rtl of counter is

  signal my_cnt : std_logic_vector(3 downto 0) := X"0";

begin

process ( clk )
begin
 if ( clk'event and clk = '1' ) then
   if ( reset = '1' ) then
     my_cnt <= X"0";
   else
     if ( load = '1' ) then
       my_cnt <= din(3 downto 0);
     else
       my_cnt <= my_cnt(3 downto 0) + '1';
     end if;
   end if;
 end if;
end process;
  dout <= my_cnt(3 downto 0);

end rtl;

A ModelSim “DO File” to provide stimulus to the counter might look like this:

[ force_counter.do ]
force clk 0 5 ns, 1 10 ns -repeat 10 ns
force reset 1
force load  0
force din   16#0
run 10 ns
force reset 0
run 40 ns
force load 1; force din 16#A; run 10 ns;
force load 0; force din 16#0; run 10 ns;
run 40 ns

When the simulation runs, you get the above expected behavior from the provided stimulus.

The equivalent test bench in Verilog looks like this:

[ tb_counter.v ]
`default_nettype none // Strictly enforce all nets to be declared
`timescale 1 ns/ 100 ps

module tb_counter
(
); // module tb_counter

  reg        reset;
  reg        clk;
  reg        load;
  reg  [3:0] din;
  wire [3:0] dout;

initial
begin
  clk <= 0;
  #5 forever
    #5 clk <= ~clk;
end

initial
begin
  #1
  reset <= 1; load <= 0; din <= 4'h0; #10
  reset <= 0; #40
  load  <= 1; din <= 4'hA; #10
  load  <= 0; din <= 4'h0; #40
  $finish;
end

counter u_counter
(
  .reset  ( reset     ),   
  .clk    ( clk       ),   
  .load   ( load      ),   
  .din    ( din[3:0]  ),   
  .dout   ( dout[3:0] )    
);

endmodule // tb_counter
`default_nettype wire // enable Verilog default for any 3rd party IP needing it

If this Verilog test bench seems like a lot of typing compared to a ModelSim “DO File” – it is. Compared to the VHDL equivalent version – it’s short.

[ tb_counter.vhd ]
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_arith.all;
use ieee.std_logic_unsigned.all;

entity tb_counter is
end tb_counter;

architecture behav of tb_counter is

component counter
port
(
  reset : in  std_logic;
  clk   : in  std_logic;
  load  : in  std_logic;
  din   : in  std_logic_vector(3 downto 0);
  dout  : out std_logic_vector(3 downto 0)
);
end component;

  signal reset : std_logic;
  signal clk   : std_logic;
  signal load  : std_logic;
  signal din   : std_logic_vector(3 downto 0);
  signal dout  : std_logic_vector(3 downto 0);

begin

process
begin
  clk <= '1'; wait for 5 ns;
  clk <= '0'; wait for 5 ns;
end process;

process
begin
  wait for 1 ns;
  reset <= '1'; load <= '0'; din <= X"0"; wait for 10 ns;
  reset <= '0'; wait for 40 ns;
  load  <= '1'; din <= X"A"; wait for 10 ns;
  load  <= '0'; din <= X"0"; wait for 40 ns;
  assert ( FALSE )
    report ("Simulation Done" )
    severity failure;
end process;

u_counter : counter
port map
(
  reset => reset,
  clk   => clk,
  load  => load,
  din   => din(3 downto 0),
  dout  => dout(3 downto 0)
);

end behav;

The length of even just a simple test bench is one of the fundamental problems with test benches. They require a lot of typing and are also a language within a language. Behavioral Verilog and VHDL is nothing like RTL in HDLs. For the most part, behavioral test benches are sequential rather than concurrent in execution. They also introduce new constructs, things like transport delays, tasks and functions.

For these reasons, I primarily create ModelSim “DO” files for stimulating new modules under development. They are much faster to write and also to modify. Quick modifications can be essential for thoroughly testing RTL designs under development. As an example, for designs including FIFOs it is very important to sweep a stimulus from one clock edge to the next.

So why bother with test benches at all? Test benches can do so much more than just provide stimulus to an HDL design.

Test Vector Stimulus and Result Capture

The simplest add-on to a Verilog test bench is to generate a VCD output file. By adding just these four lines of Verilog, a VCD viewer like GTKWave may be used to observe the simulated waveforms. Now the test bench both provides stimulus AND captures results in the form of a VCD waveform capture.

initial begin
  $dumpfile("tb_counter.vcd");// VCD file for GTKwave
  $dumpvars(1, tb_counter.u_counter );// 1=this, 0=hier
end

Exporting a VCD is crucial for command line only simulators like IcarusVerilog. A VCD also provides for a design history record of the simulation. Unfortunately VHDL test benches can not generate VCD files ( it’s a Verilog / SystemVerilog specific thing ). As luck would have it, ModelSim make it easy to generate a VCD from your simulation with just two easy commands. You specify the VCD file name and then you specify the signal names of signals to add to the VCD file.

vsim tb_counter
vcd file tb_counter.vcd
vcd add tb_counter/u_counter/*
run 1 us
quit

VCD files are great for waveform viewing, but are quite cryptic for data analysis. Thankfully both Verilog and VHDL make it easy to store signal values as clear text files that are readable by both humans and software programs. An example in Verilog:

integer file_ptr;
initial begin
  file_ptr = $fopen("dump.txt", "w" );
end

always @ ( posedge clk ) begin
  $fdisplay(file_ptr,"%S = %01X", "DOUT", dout);
end

and in VHDL:

use ieee.std_logic_textio.all;
library std ;
use std.textio.all;

process
  file file_ptr : text is out "dump.txt";
  variable text_line : Line;
begin
  wait until ( clk'event and clk = '1' );
    text_line := null;
    write ( text_line, string'("DOUT = "));
    hwrite( text_line, dout );
    writeline ( file_ptr, text_line );
end process;

The text output looks identical for both:

[ dump.txt ]
DOUT = 0
DOUT = 0
DOUT = 1
DOUT = 2
DOUT = 3
DOUT = 4
DOUT = A
DOUT = B
DOUT = C
DOUT = D

Cool, huh? And of course the reverse is true. Stimulus may be provided in a clear text file instead of writing cryptic behavioral HDL code. An example stimulus.txt file:

[ stimulus.txt ]
1 0 0
0 0 0
0 0 0
0 0 0
0 1 A
0 0 0
0 0 0
0 0 0

These stimulus files are called test vectors.

A test vector is a set of inputs provided to a system to test its functionality and behavior. These inputs are used to verify that the system operates correctly under various conditions.
ChatGPT-4

Writing Verilog or VHDL to parse text files is non-trivial, but possible. The trick I have discovered over the years is to keep the file format VERY simple. As an example, stuffing test vector files with hexadecimal data that is fixed width. Below are Verilog and VHDL examples for controlling the reset and load features of the 4 bit test counter given a hexadecimal stimulus file. An example in Verilog:

integer file_in_ptr;
initial begin
 file_in_ptr = $fopen("stimulus.txt", "r");
end

reg [8*80:0] txt_in_line;

always @(posedge clk) begin
  if ($feof(file_in_ptr)) begin
    $fclose(file_in_ptr);
    $finish; 
  end else begin
    $fgets(  txt_in_line , file_in_ptr );
    $sscanf( txt_in_line ,"%1x %1x %1x", reset, load, din );
  end
end

An example in VHDL:

process
  file file_in_ptr      : text is in "stimulus.txt";
  variable text_in_line : string(1 to 80);
  variable text2_line   : Line;
  variable line_len     : integer;
  variable parse_word   : string(1 to 80 );
  variable reset_var    : std_logic_vector(3 downto 0);
  variable load_var     : std_logic_vector(3 downto 0);
  variable din_var      : std_logic_vector(3 downto 0);
begin
 while not ( endfile( file_in_ptr ) ) loop
  read( file_in_ptr, text_in_line, line_len );
  wait until ( clk'event and clk = '1' );
    text2_line := null;
    parse_word(1 to 2) := text_in_line(1) & NUL;
    write( text2_line, parse_word );
    hread( text2_line, reset_var );

    text2_line := null;
    parse_word(1 to 2) := text_in_line(3) & NUL;
    write( text2_line, parse_word );
    hread( text2_line, load_var );

    text2_line := null;
    parse_word(1 to 2) := text_in_line(5) & NUL;
    write( text2_line, parse_word );
    hread( text2_line, din_var );

    reset <= reset_var(0);
    load  <= load_var(0);
    din   <= din_var(3 downto 0);
 end loop;
 assert ( FALSE )
   report ("Simulation Done" )
   severity failure;
end process;

What should be clear from the above is that VHDL is not that great at dealing with strings. Dreadful as it is, it is still possible. Both Verilog and VHDL test benches are able to read in and write out flat text files that interface directly with RTL signals under simulation. This ability is the key to the castle for having high level test benches written in software. High level software languages like Python can’t interact directly with RTL signals under simulation, but they can certainly read and write text files and use a behavioral HDL test bench as the go between.

As an example, decades ago I wrote a Perl+VHDL test bench for testing a dual-clock FIFO for a PCI interface in an ASIC. The Perl program created a transfer payload and then split it up into random length segments in the form of a text file that pushed to the FIFO at random times with random lengths. Perl launched ModelSim with a simple VHDL test bench which just read in the text file and created an output text file of the FIFO pops. The Perl program then compared the results and iterated. The UNIX/Linux environment makes it easy to launch applications like ModelSim from within a program and wait for an outcome. This test setup would autonomously runs for days looking for any potential problems.

One issue with test vectors is generation of extremely large files. A typical test vector might contain all the signal stimulus for a clock cycle on a single line. This is fine for simulating ten or a hundred clock cycles – but what about a million clock cycles or more? For this adding a “wait” type command might be beneficial. Do a much of stimulus, then wait some pre-determined time before doing more. The below Verilog enhances the original design to support both a “wait n” command and also comment lines ( which are ignored completely ). Example in Verilog:

reg [8*80:0] txt_in_line;
reg [8*16:0] my_word;
reg [15:0]   wait_time;
integer j;

always @(posedge clk) begin
  if ($feof(file_in_ptr)) begin
    $fclose(file_in_ptr);
    $finish;
  end else begin
    $fgets(  txt_in_line , file_in_ptr ); // Grab entire Line
    $sscanf( txt_in_line ,"%s %d", my_word , wait_time );
    if ( my_word == "wait" ) begin
      $display("Waiting...%d", wait_time );
      for ( j = 0; j < wait_time; j=j+1 ) begin
        @( posedge clk );
      end // for j
    end else if ( my_word != "#" ) begin
      $sscanf( txt_in_line ,"%1x %1x %1x", reset, load, din );
    end
  end
end

Example stimulus.txt file:

[ stimulus.txt ]
# This is my stimulus file
1 0 0
0 0 0
0 0 0
0 0 0
# Waiting 10 clocks before load
wait 10
0 1 A
0 0 0
0 0 0
0 0 0

Stimulus with Output Analysis

For those that don’t want to outsource to a high level software programming language ( like Python ), HDL based test benches can also internally check for correct behavior of a module under test. A simple example is a Verilog test bench that checks for a certain value at a certain time and halts the simulation on a bad result. Example in Verilog:

always @ ( posedge clk ) begin
  if ( $time == 70 ) begin
    if ( dout == 4'hA ) begin
      $display("Load Successful!");
    end else begin
      $display("Load Failed!");
      $stop;
    end
  end
end

Test benches may also be split up into multiple files with functions and tasks. Below is a simple function example that takes in two inputs and returns a result.

[ tb_counter.v ]
`include "func_check_counting.v"
always @ ( posedge clk ) begin
  dout_p1 <= dout[3:0];
  if ( func_check_counting( dout, dout_p1 ) == 0 ) begin
    $display("Counter stopped counting at T = %d", $time );
  end
end

[ func_check_counting.v ]
function func_check_counting;
  input [3:0] data_new;
  input [3:0] data_old;
  if ( data_new == data_old + 1 ) begin
    func_check_counting = 1;
  end else begin
    func_check_counting = 0;
  end
endfunction

Functions can take in multiple inputs, but are only capable of returning a single output. Tasks are similar to functions, but may include time delays and perform sequential operations. The following task is an example of how tasks can incorporate delays and also access signals outside the scope of the task itself.

[ tb_counter.v ]
`include "task_check_counting.v"
always @ ( posedge clk ) begin
  dout_p1 <= dout[3:0];
  task_check_counting( dout, dout_p1 );
end

[ task_check_counting.v ]
task task_check_counting;
  input [3:0] data_new;
  input [3:0] data_old;
  if ( data_new != data_old + 1 ) begin
    reset <= 1;
    #10
    reset <= 0;
    #40
    ;
  end
endtask

My favorite and almost undocumented feature of Verilog test benches is the ability to include inline Verilog files at runtime. What does that mean? A test bench in Verilog can decide to include and execute an external Verilog file when some signal condition is true. Here’s a simple example.

[ tb_counter.v ]
always @ ( posedge clk ) begin
  if ( dout[3:0] == 4'hA ) begin
    `include "inline_a.v"
  end
end

[ inline_a.v ]
begin
 $display("The time is %d", $time );
 $display("counter is at 0xA" );
end

This ability to import external files runtime might seem trivial, but it is incredibly powerful. A sequential Verilog test bench need not be one huge mess of a file, but can be broken down into multiple smaller files.

This short chapter on test benches only scratches the surface on the subject. Entire books can ( and should ) be written on test bench creation for digital chip designs. Writing test benches in Verilog and VHDL can be painful as they are very rudimentary software constructs that were added to otherwise fantastic RTL languages. Writing VHDL and Verilog for text file processing is especially bad, reminds me of Fortan77. I really like my flow of using a high level programming language ( like Python ) that creates test vectors, launches a simulation and then analyzes captured results.

If writing advanced test benches is your goal, I highly recommend looking into SystemVerilog ( standardized as IEEE 1800 ). SystemVerilog makes multiple enhancements to Verilog for writing more advanced test benches. If you are writing VHDL RTL, you might consider using a Verilog test bench around it, if for nothing else, the VCD export capability alone.

EOF

http://blackmesalabs.wordpress.com/?p=2893

Extensions

BML FPGA Design Tutorial Part-15ofN : Digital Logic Simulation

kevinhub88 Sep 9, 2024 Updated Sep 17, 2024

Show full content

2024.09.08 : I’m BSEE Kevin Hubbard from Seattle, WA. I design digital logic chips that are fast, small, low power, and reliable. That’s actually the working title for my upcoming book, “FPGA Chip Design for Speed, Area, Power, and Reliability”. I’ve been doing this ( digital chip design ) for more than 30 years and decided it’s high time that I give back some of my experience to the next generation of digital chip designers that are up and coming. This is Part-15 of my “Getting started with FPGAs” series which starts here. It is an early sneak preview for a chapter on Digital Logic Simulators that will be in my book in 2025.

My First Digital Logic Designs
My early on digital logic designs weren’t chips, but circuit boards filled with discrete 7400 series latches and gates. The first design was an SRAM based ROM emulator for a medical blood gas analyzer. The second was this massive test fixture for an 80×86 motherboard to be used in emergency vehicles. The test fixture was rows and rows of 7400 DIP logic chips that looked a bit like Woz‘s first Apple computer design.

Both boards were controlled over a simple Centronics parallel printer port designed in 1970. The heart of these boards was a custom “port expander” that I designed on paper and in my head. I worked out, “when this signal does this, the latch will do this and the 74138 will decode this to this”. It was all very timing consuming and limiting in the number of gates I could use and hope to make work on my first PCB board pass. I asked myself, “What if I had a computer program that could simulate what these chips will do instead of doing it all in my head?”

Years later my wish was granted and I started using logic simulation tools.

What is an RTL Simulator?

ChatGPT-4 – “An RTL (Register-Transfer Level) simulator is a tool used in digital circuit design to simulate the behavior of a circuit described at the register-transfer level. RTL is a high-level abstraction used to model the flow of data between hardware registers and the logical operations performed on that data. This level of abstraction is commonly used in hardware description languages (HDLs) like Verilog and VHDL.”

ChatGPT-4 AI goes on to explain some key points:

Purpose: RTL simulation helps designers verify that their digital circuits function correctly before moving on to more detailed design stages or physical implementation. It allows for testing different scenarios and inputs to catch potential errors or design flaws early.
Speed: Since RTL simulation operates at a higher level of abstraction, it is generally faster than gate-level simulation, which deals with the detailed implementation of the circuit.

The entire EDA tool flow of RTL to FPGA bitstream ( Synthesis, Map, Place, and Route ) takes time. Typically between minutes to multiple hours depending on the size of the design. RTL Simulators allow for testing early on at the RTL level to see what the final design should behave like. Simulating early at the RTL level is a tremendous productivity improver. before actually taking the RTL all the way to the implementation in hardware stage.

Simulators also support stimulating individual components of a design – versus the entire chip at once. Often times component level simulations simplify the process of “code coverage”, meaning testing all features of the design.

Some simulators also support final gate level simulations. Gate level simulations come in to flavors. Ideal gate level simulations have no gate or routing delays. SDF ( Standard Delay Format ) simulations use final actual timing for gate and route delays across both best-case and worst-case PVT ( Process Voltage and Temperature ).

Gate level simulations are horrifically slow and are typically performed for the most rudimentary of tests. The major advantage of RTL simulations over Gate-Level simulations is speed. If you know synthesis works ( it does ) and that your mapped, placed and routed design makes timing ( via static timing analysis reports ) simulating your design at the higher RTL level easily provides a 1,000x performance boost over gate level simulations. Performing a gate level simulation on a billion transistors is a new 21st century definition of impractical.

Simulators are Slow

Even fast simulators are incredibly slow compared to the actual circuit running at speed. Why? Every flip-flop in the design must be analyzed in software for every simulated hardware clock cycle. The math on this is very simple to understand. If you have a 100 MHz FPGA design and are attempting to simulate it with a 1 GHz CPU, it is a losing battle once your design has more than 10 flip-flops in it. FPGAs at 14nm can have up to 10 million flip-flops. What this means is if you are trying to simulate a CPU and “boot” and OS on the simulation, it might take weeks to get to a command prompt, even on a really fast and well engineered operating system like Linux.

This chapter dives into four different RTL simulators, ModelSim, VivadaSim, IcarusVerilog, and Verilator. The first two are close source and proprietary. The last two are free and fully open source.

The following simple 4 bit counter design in Verilog will be used to demonstrate all four Simulators. Note that IcarusVerilog and Verilator are Verilog only simulators. ModelSim and VivadoSim support both Verilog and VHDL.

[top.v]
timescale 1 ns/ 100 ps

module top
(
  input  wire       reset,
  input  wire       clk,
  output wire [3:0] led
);// module top

  reg [3:0] my_cnt = 4'd0;

always @ ( posedge clk ) begin
  if ( reset == 1 ) begin
    my_cnt <= 4'd0;
  end else begin
    my_cnt <= my_cnt[3:0] + 1;
  end
end
  assign led = my_cnt[3:0];

endmodule // top.v

ModelSim

ModelSim simulator is the “Gold Standard” for simulating HDLs ( Verilog + VHDL ). I’ve been using ModelSim for 30 years now and it is my default simulator.

To get started with ModelSim use the “vlog” or “vcom” command to compile top.v (above) or a top.vhd RTL source file. On my Linux workstation it looks like this:

[linux_console]
[khubbard@GLaDOS part_simulators]$ vlog top.v
Model Technology ModelSim - Intel FPGA Edition vlog 2020.1 Compiler 2020.02 Feb 28 2020
Start time: 16:16:43 on Sep 07,2024
vlog top.v 
-- Compiling module top

Top level modules:
	top
End time: 16:16:43 on Sep 07,2024, Elapsed time: 0:00:00
Errors: 0, Warnings: 0
[khubbard@GLaDOS part_simulators]$

If there are any problems ( syntax errors, etc ) with your RTL, the console will report them as errors and you will have to fix them before continuing. Verilog is more forgiving than VHDL and some things like missing port connections will be reported as warnings, which you can still simulate with.

This RTL file gets compiled into executable software under a ./work subdirectory. Simulators do not compile RTL into gates – only Synthesis does that. The compiled files generated under ./work are a special ModelSim binary mean for CPU execution. I like to think of them as a ModelSim equivalent to Java bytecode. The vlog/vcom compiled Verilog/VHDL does not become actual x86 machine code, but instead a custom binary instruction set that the ModelSim simulator can execute via an internal interpreter. Super fast. Like interpreted Python fast. Not Verilator fast, but save that thought for later.

Now that the top.v Verilog file is compiled ( vlog’d ), launching the simulator is as easy as typing “vsim top”. Note that it is “top” – the module’s name and NOT “top.v”, the filename for the module. For the most part, you can substitute a Verilog module for a VHDL module within a design so long as the port mapping is the same and ModelSim will not care. The command line console will look rather boring, but the GUI should pop up.

[linux_console]
[khubbard@GLaDOS part_simulators]$ vsim top
Reading pref.tcl

When the GUI launches, the design “top” will show up under “sim/structure” window on the left. If this was a hierarchical design ( it isn’t ), an expandable / collapsible hierarchy tree would appear here. Selecting a particular module in the tree would then update the “Objects” window which displays signals ( internals, inputs and outputs ) for that selected module.

By right-clicking on a particular signal, you can “Add Wave” it to the “Waveform”. Other options are to go to the CLI “Transcript” window at the bottom and type “add wave signal_name” or “add wave *”. These commands are all TCL scriptable within “DO” files too of course. GUIs are typically intuitive, but also slow. The CLI interface along with “DO” script files are your friend to being that “10x Engineer”.

Once signals have been assigned to the “Waveform” window, simulating the design is as easy as providing stimulus. This 4-bit counter design is super simple and only requires a clock stimulus via:
“force clk 0 5 ns, 1 10 ns -repeat 10 ns; run 100 ns” which will stimulate the design with a 100 MHz (10ns) clock. You also need to place in reset and take it out of reset with a “force reset 1”, “force reset 0”.

This can all be done from the CLI ( Command Line Interface ) in the transcript window as shown below.

Alternatively ( and recommended ) is to write a short “do script” that does the same. That script may then be rapidly modified ( or cloned ) to slightly change the stimulus of the design. Test Benches are great later on in the design process as a design matures. Early on in development ( when all the new RTL is being written ), having a collection of dozens to hundreds of “do script” files for stimulating a module under varying input stimulus scenarios is highly advantageous for “code coverage”.

The difference between writing a “do script” and a test bench is like the difference between jotting notes down on an engineering pad versus desktop publishing something and printing it out at your local Kinko‘s. Yes, I realize both of those things no longer exist.

I typically break my collection of ModelSim “do scripts” into two categories:

ModelSim DO Script Categories:

Wave Files : Define what signal to view.
Force Files : Provide Stimulus to a design.

For the example design, my do file collection would start out looking like the below. The “_01” is so that I can make multiple versions of the same file with stimulus varying ever so slightly:

[force_top_01.do]
# My Comment : Run the counter coming out of reset
force clk 0 5 ns, 1 10 ns -repeat 10 ns
force reset 1; run 10 ns
force reset 0; run 10 ns
run 100 ns

[wave_top_01.do]
add wave /top/reset
add wave /top/clk
add wave /top/my_cnt -radix hexadecimal
add wave /top/led    -radix unsigned

Executing the scripts from the CLI in the transcript window looks like this:

[transcript window]
do wave_top_01.do
restart -f; do force_top_01.do

The “restart -f” tells ModelSim to start the simulation over from T=0. This is useful for sitting in an iteration loop of modifying the stimulus file and simming it again. Pressing the “up arrow” will bring the last command back up. The ModelSim CLI also inherits many useful UNIX CLI tools including bang-bang (“!!”) and bang-n (“!n”) for command history repeats.

Something to remember about the ModelSim GUI is it has many windows which are “undockable” from the main GUI window. When a window is “docked”, the pulldown menu tools for that window are only visible (and available) when that particular window has focus. It takes a bit of getting used to.

I always “undock” the waveform “Wave” window as I am typically wanting to observe hundreds of signals across hundreds of clock cycles and need the waveform to fill up every pixel on my screen. It’s easy to see the challenges of simulating a billion transistor FPGA design. It’s impossible to observe everything on the screen at once. Having a collection of “wave” type “do” files is highly recommended.

The ModelSim simulator is truly the “Gold Standard” for both RTL and Gate-Level simulations in both Verilog and VHDL. It really is THAT good. ModelSim’s powerful UNIX influenced command line interface is the icing on the cake.

VivadoSim

VivadoSim is the simulator that is included with the AMD/Xilinx Vivado EDA tool. It includes a built-in waveform viewer but I prefer to use the GTKWave VCD viewer as a back-end viewer for VivadoSim. Together they are a great combination for simulating AMD/Xilinx IP blocks ( SERDES modules, PCIe, DDR cores, etc. ). Written by Tony Bybell, GTKWave is free and open-source and really an excellent VCD waveform viewer.

VivadoSim is a highly capable RTL and gate level simulator for simulating AMD/Xilinx FPGA designs. Attempting to design an Altera FPGA using VivadoSim will likely be a frustrating experience. IP things like FIFOs – you just won’t be able to simulate. Always be cautious about the horse you hitch your wagon to. VivadoSim is a one-trick pony.

TIP: I recommend against getting locked into a vendor specific tool as it creates a huge barrier against changing vendors for other reasons ( better chip pricing, performance, features, etc ).

Just like with ModelSim, with VivadoSim the Verilog RTL needs to be compiled prior to simulation. The first step is to create a “project” file which points to one or more Verilog ( and/or VHDL ) design source files.

[top.prj]
verilog work "$XILINX_VIVADO/data/verilog/src/glbl.v"
verilog work "top.v"

Note the “glbl.v” file, it is a necessary evil for running Verilog simulations on both ModelSim and VivadoSim. This first line tells the simulator to compile \u201cglbl.v\u201d from the Vivado install path pointed to by the TCL variable $XILINX_VIVADO.

Now that we have a project file, it is time to create a compile script. This can be a UNIX shell script or an MS-DOS batch file, it makes no difference. Just change the file extension from *.sh to *.bat if you are still stuck in a 1980’s MS-DOS CLI world. Also, I am terribly sorry.

[compile.sh]
xvlog -prj top.prj

This script is rather boring. It is telling the OS ( Linux, in my case ) to launch the command line executable “xvlog” in the Vivado install path and pass it the name ( and location ) of the project file to use. It turns out that “xvlog” is just another shell script that calls another program that is actually a compiled binary. EDA tools can be like that and is one of the interesting evolutions of UNIX / Linux. It is quite practical to have a single monolithic binary and have multiple shell scripts call it while they pretend to be different tools from the command line. Gets the job done. I am not complaining.

[linux_console]
[khubbard@GLaDOS part_simulators]$ source compile.sh
INFO: [VRFC 10-2263] Analyzing Verilog file "/opt/Xilinx/Vivado/2022.2/data/verilog/src/glbl.v"..
                     into library work
INFO: [VRFC 10-311] analyzing module glbl
INFO: [VRFC 10-2263] Analyzing Verilog file "/home/khubbard/nas/blackmesa/xilinx/xilinx_artix7/..
                     digilent_basys3/part_simulators/top.v" into library work
INFO: [VRFC 10-311] analyzing module top
[khubbard@GLaDOS part_simulators]$

Compiling the RTL design is as simple as typing “source compile.sh”. If there are any syntax errors or warnings in “top.v” , they will be reported in STDOUT.

So what did compiling “top.v” and “glbl.v” accomplish? We have to dig down into the subdirectory ./xsim.dir/work to see that it generated 3 new files.

The file “work.rlx” is actually a text file. It appears to be a table of contents for what is in the work directory including full hard paths to the source files. In a pinch a person could work backwards to figure out where a design was compiled from. The two “*.sdb” files “glbl.sdb” and “top.sdb” are purely binaries. I think of them as pseudo Java Byte Code representations of the compiled Verilog.

Now that “top.v” and “glbl.v” are compiled, it is time to simulate them. We will create a “simulate.sh” command line script ( or “*.bat” for MS-DOS users ) that looks like this:

[simulate.sh]
xelab top glbl -prj top.prj -s snapshot -debug typical -initfile=$XILINX_VIVADO/data/xsim/ip/xsim_ip.ini -L work -L xpm -L unisims_ver -L unimacro_ver
xsim snapshot -tclbatch do_files/do.tcl

Wow, that certainly is a lot to take in. Let us digest it piece by piece.
The first line with “xelab” takes some explaining. With the “-s snapshot” option, it will generate a “csnapshot.wdb” binary file that is a single simulation “executable” of the compiled Verilog ( of “top.v” and “cglbl.v” as well as a long list of pre-compiled AMD/Xilinx library primitives that are all linked with the “-L” flag.

These primitives are for things like flip-flops, clock tree buffers, SERDES transceivers. What is important to know is that any user generated IP for things like FIFOs will require these libraries for simulation. It is a key advantage to using VivadoSim over other simulators as all of the primitives are already included and pre-compiled, ready to be simulated.

The second line with “xsim snapshot -tclbatch do_files/do.tcl” says to simulate the “snapshot.wdb” binary file and apply stimulus as specified in the “do.tcl” file. What is in “do.tcl” ? A whole bunch of stuff. Note that the “\do_files\” is just an optional subdirectory where I chose to dump all my stimulus files in one place. I could have named the directory “asparagus” instead or left everything flat at one directory level.

[do.tcl]
open_vcd
source do_files/wave.tcl
source do_files/force.tcl
close_vcd
quit

Line-by-line breakdown:

open_vcd : Instructs the simulator to create a new VCD file to store simulator results to.
source do_files/wave.tcl : Instructs the simulator to execute an external file.
source do_files/force.tcl : Instructs the simulator to execute another external file.
close_vcd : Instructs the simulator to close out the newly created VCD file.
quit : Instructs the simulator to stop and exit back to the OS command line.

[wave.tcl]
log_vcd [get_objects -r * ]
#log_vcd {clk_wr}
#log_vcd {/fifo_1024x36/clk_wr}

The “wave.tcl” file specifies which signals to put in the output VCD file. The 1st line ( that has no preceding comment ) says to capture ALL signals of the design. I included the next two commented out lines just to show how to capture only certain signals at certain hierarchy instance levels.

[force.tcl]
add_force clk {0 0ns} {1 5ns} -repeat_every 10ns
add_force reset 1
run 10ns
add_force reset 0
run 100ns

The “force.tcl” provides stimulus to the design. The “add force clk” line specifies a repeating 100 MHz ( 10 ns ) clock of 50/50 duty cycle. The “run” commands tell the simulator the length of time to simulate between force commands. If you have wide inputs, you can specify a non-binary radix, for example “add_force my_byte AA -radix hex”.

After running “simulate.sh”, the simulator will spit out “dump.vcd”. VCD files are clear text files containing Run Length Encoded binary representations of signals. They are not intended to be read by humans. That said, if you study them long enough, they are far easier to decipher than some Enigma encrypted proprietary binary file format.

Viewing “dump.vcd” using GTKWave is simply a matter of opening the file and selecting the signals to view. For this simple design I clicked on “top” which then displayed all of the signal names on the left bottom pane. I selected the signals I wanted and then clicked Append, which added those and only those signals to the waveform display. GTKWave is available on GitHub and also as pre-compiled binaries for various platforms. GTKWave is an excellent viewer that I can not recommend enough. It is true that Vivado has an internal waveform viewer. I am just not interested in learning it as GTKWave works with any tool that can export a VCD file. Manual is available here https://gtkwave.sourceforge.net/gtkwave.pdf.

As a footnote, ModelSim will also happily display the output VCD file for you. You just need to convert from VCD to WLF format using the command line tool called (wait for it) ….. “vcd2wlf” that is included with ModelSim.
%vcd2wlf dump.vcd dump.wlf

So why not just launch vsim with a “-view dump.vcd” flag? That is a very good question. I suspect it is to encourage users to use the ModelSim built in viewer as it consumes a license while using GTKWave does not.

Using ModelSim to view VCD results from VivadoSim may seem silly, but I have actually used this path many times. Simulating hard or encrypted IP blocks from AMD/Xilinx can be a difficult using ModelSim and they often just work with VivadoSim. I am not familiar with the VivadoSim GUI though. Running the simulation with VivadoSim and viewing the results with ModelSim is actually my preferred path. This isn’t implying that GTKWave isn’t great ( it is ). I am just used to using ModelSim and it is always my first choice.

IcarusVerilog
IcarusVerilog ( aka Iverilog ) is an open-source compiler and simulator developed by Stephen Williams. Iverilog is just a compiler and simulator. There is no integrated waveform viewer or even a fancy GUI interface. It’s a command line console tool that compiles and simulates Verilog and can output a VCD file for viewing with a tool like GTKWave. It’s also free.

Installation on my Ubuntu 22.04.3-desktop-amd64 was super simple.

[ubuntu_linux_console]
sudo apt install iverilog

Unfortunately Iverilog does not support force files and must be simulated using a Verilog test bench. Test benches will be described in detail in another chapter. The following is a test bench that stimulates the counter design and generates an output VCD file.

[tb_top.v]
`timescale 1 ns/ 100 ps
`define CLK_PRD 10

module tb_top
(
); // module tb_top
  reg        clk;
  reg        reset;
  wire [3:0] led;

//--------------------------------------------------------------
// Startup Stuff
//--------------------------------------------------------------
initial
begin
  $display("Welcome to Iverilog Simulation");
  $dumpfile("tb_top.vcd");// VCD file for ModelSim or GTKwave
  $dumpvars(1, tb_top.u_top );// Dump only this level to VCD. 0=hier
end

//--------------------------------------------------------------
// 100 MHz Clock Oscillator
//--------------------------------------------------------------
initial
begin
  clk <= 0;
  #(`CLK_PRD/2) forever
   #(`CLK_PRD/2) clk <= ~ clk;
end

//--------------------------------------------------------------
// Reset Pulse
//--------------------------------------------------------------
initial
begin
  $display("In Reset        T=%t", $time);
  reset <= 1;
  #(`CLK_PRD)
  reset <= 0;
  $display("Out of Reset    T=%t", $time);
  #(`CLK_PRD*10)
  $display("End Simulation  T=%t", $time);
  $finish;
end

// -------------------------------------------------------------
// Instantiate the Unit Under Test
// -------------------------------------------------------------
top u_top
(
  .clk    ( clk      ),
  .reset  ( reset    ),
  .led    ( led[3:0] )
);

endmodule // tb_top

The first step to simulation is to compile the design into 80×86 machine code. That’s accomplished with the iverilog command pointing to a project file that lists all of the Verilog files.

[compile.sh]
iverilog -o runme -c top.prj

[top.prj]
tb_top.v 
top.v

[linux_console]
%./runme
Welcome to Iverilog Simulation
VCD info: dumpfile tb_top.vcd opened for output.
In Reset        T=                   0
Out of Reset    T=                 100
End Simulation  T=                1100

Once the simulation has finished, the output VCD file may be viewed with GTKWave as before.

[linux_console]
%gtkwave tb_top.vcd

Verilator
Like IcarusVerilog, Verilator is a free and open-source Verilog simulator written by Wilson Snyder. They are not the same though. IcarusVerilog is a traditional RTL simulator which supports multiple clocks and time delays. Verilator is a compiler that converts Verilog to C++ in order to produce extremely fast, multithreaded simulation models of a single clock Verilog design.

ChatGPT-4 – “Verilator is an open-source tool that converts Verilog into C++. Unlike traditional simulators, Verilator acts as a compiler, generating highly optimized, cycle-accurate models that can be used for simulation and verification purposes.”

So what is Verilator good for? Super fast simulations of Verilog IP blocks with single clock domains. Think digital filters, RISC-V CPU cores. Things like that. Is Verilator a chip simulator? No. It won’t simulate PLLs, FIFOs, IP blocks like PCIe and DDR memory interface controllers. Verilator does one very simple thing ( convert simple single clock Verilog to C++ ) and it does it very well.

To get started with Verilator we need to install it. On Ubuntu Linux it’s a very simple process.

[linux_console]
sudo apt install verilator

From there we need to create a test bench in C++. I will also create a “tb_top.v” Verilog file that isn’t really a test bench. Instead it is a simple wrapper around the “top.v” counter design. I do this only to provide an example of simulating a design with multiple Verilog files.

[tb_top.v]
`timescale 1 ns/ 100 ps

module tb_top
(
  input  wire       clk,
  input  wire       reset,
  output wire [3:0] led
); // module tb_top


// --------------------------------------------------------
// Instantiate the Unit Under Test
// --------------------------------------------------------
top u_top
(
  .clk    ( clk      ),
  .reset  ( reset    ),
  .led    ( led[3:0] )
);

endmodule // tb_top

The actual test bench stimulus must be written in C++ and looks like this:

[tb_top.cpp]
#include <stdlib.h>
#include <iostream>
#include <verilated.h>
#include <verilated_vcd_c.h>
#include "Vtb_top.h"

#define MAX_SIM_TIME 11*2
vluint64_t sim_time = 0;

int main(int argc, char** argv, char** env) {
  Vtb_top *dut = new Vtb_top;

  Verilated::traceEverOn(true);
  VerilatedVcdC *m_trace = new VerilatedVcdC;
  dut->trace(m_trace, 5);
  m_trace->open("tb_top.vcd");

  while (sim_time < MAX_SIM_TIME)
  {
   dut->reset = 0;
   if ( sim_time < 1*2 )
   {
      dut->reset = 1;
   }
     std::cout << "NOTE: "
               << "led = " << (int)( dut->led )
               << " simtime = " << sim_time << std::endl;
      dut->clk ^= 1;
      dut->eval();
      m_trace->dump(sim_time);
      sim_time++;
  } // while

  m_trace->close();
  delete dut;
  exit(EXIT_SUCCESS);
}

Next step is to compile the Verilog in to C++ and compile it into 80×86 assembly. As of this writing, Verilator doesn’t support a “project file” that lists all of the Verilog files to compile. This makes me sad.

[compile.sh]
verilator -Wall --trace -cc tb_top.v top.v --exe tb_top.cpp
make -C obj_dir -f Vtb_top.mk Vtb_top

Now that it is compiled, it is time to execute the 80×86 assembly program. I like to call these “go” scripts.

[go.sh]
./obj_dir/Vtb_top

The output will be the VCD file “tb_top.vcd” which may then be viewed with ….. ( wait for it ) … GTKWave.

In closing, I like what Verilator is trying to do. It is a VERY fast Verilog to 80×86 converter, which would be great if I was designing a RISC-V core and needed to “execute” software on my Verilog design. That said, it isn’t a real digital logic simulator. ModelSim remains my favorite digital logic simulator. I just wished I had it back in 1993.

[EOF]

http://blackmesalabs.wordpress.com/?p=2833

Extensions

BML FPGA Design Tutorial Part-14ofN : Clocking

kevinhub88 Aug 25, 2024 Updated Aug 25, 2024

Show full content

2024.08.25 : I’m BSEE Kevin Hubbard from Seattle, WA. When a stranger asks what I do, I typically answer “I design computer chips” and they usually stare at me blankly. I don’t actually work for Nvidia, AMD or even Intel. I use the term “computer chips” generically for “large digital semiconductor chips” which is a mouthful that very few understand. For 30+ years I’ve been designing digital ASICs and FPGAs. I’m giving back now in writing this “Getting started with FPGAs” series which starts here. I hope others may learn a little bit from my experiences going back to the early 1990’s.

An entire chapter just on clocking? Yes! It’s a complicated subject, probably worthy of a book in itself. Every Flip-Flop needs a clock and for two or more Flip-Flops to communicate reliably and efficiently, those flops must get clocks that are both frequency locked and phase aligned.

So what is a clock?

ChatGPT-4 : A clock is a crystal that vibrates at a specific frequency when electricity is applied. The clock keeps everything in sync within the computer system, ensuring orderly execution of tasks and processes.

Think of a clock as the conductor that keeps the entire orchestra in sync with each other. A singular clock is an essential feature for digital logic design. It’s the conductor that says “Hey flip-flops, NOW is the time to sample your D input and latch it to your Q output”. Without a clock we might as well go back to building analog computers that are never quite right. ( NO, let’s NOT do THAT! )

What is Frequency Locked?

Frequency Locked means having a single crystal oscillator that is used (distributed) across an entire digital system. It’s easy to assume that having two chips where each has their own crystal oscillator that are both “100 MHz” would be frequency locked. They aren’t. They’re very close, but not locked. Their frequencies will vary by a small fraction of a percentage. Just enough to matter after a few million clock cycles.

Think of two conductors of two orchestras that start out watching each other in perfect synchronization but eventually continue on by themselves. They will eventually drift slightly apart – as will the two orchestras that they are leading. It’s a tiny amount, but the impact grows greatly in time. After a sufficient amount of time passing the two orchestras are playing two different songs. Having two crystals in a system is exactly like that.

In the very old days of the ancients clocking was simple. You would have a single system clock oscillator that might be divided down to some slower early digital logic friendly frequency and distributed across a printed circuit board for all ICs with flip-flops to use. Good examples are the original Apple ][ computer with a single 14.31818 MHz crystal oscillator.

On Page-24 of The Apple II circuit description(1983) Winston Gayler explains how a single 14.31818 MHz crystal is divided down to run the 8bit 6502 CPU at 1.023 MHz. Why that funny frequency? Using discrete digital logic chips at the time, it was easy to divide 14.31818 MHz by 14 to get 1.023 MHz, a frequency close to, but not above the original 6502’s Fmax.

Now get this, the original IBM PC used THE SAME 14.31818 MHz crystal but divided by 3 to run the 8088 CPU at 4.77 MHz. What about the 68000 CPU in the Commodore Amiga 1000? It ran at 7.15909 MHz which is ( wait for it ) HALF of 14.31818 MHz.

Why were so many early computers using the same odd 14.31818 MHz master clock? That frequency just happens to be 4x of the 3.579545 MHz NTSC color burst standard.

Back in the 1970’s and 1980’s if you wanted a computer to have a color video output, that video interface had to be centered around the 3.579545 MHz color burst frequency from the 1953 NTSC color television standard. Early computers used CRTs designed for Color Television since they were made in high volume and available for lower cost. Custom computer specific CRTs with better resolution and custom timing ( EGA, VGA, etc ) came much later, not until computers became more popular.

The math of NTSC may seem strange but it all adds up ( or divides down as the case is ). It’s been triple-checked with slide rulers. So why 3.579545 MHz exactly? Well it is 315/88 MHz. That frequency is 455/2 times the 15.7 kHz line rate which is 262.5 times the field rate of 59.94 Hz ( 29.97 fps interleaved ).

Early computers had video frame buffer memory that was shared with the CPU main memory. Running the CPU at a NTSC friendly ( integer dividable ) frequency was important. By NOT doing this, you’d end up with video jitters and computer users would likely upchuck their lunches.

The history of how original 1941 B&W NTSC was tweaked in 1953 to add color without obsoleting existing B&W TV’s is a fascinating read. Those ancient engineers were absolute giants in pulling that off. Read up on NTSC RS-170a technical details. As a footnote, the nearly obsolete technology of NTSC color TV lives on today in inexpensive automotive backup cameras.

In the 1990’s era of ASICs using a single system clock was still common. Wide 32bit parallel buses like PCI-33 would have a single 33 MHz oscillator that would fanout to expansion cards. All of the address and data bits would be launched and captured from the same clock source. As PCI rates increased from 33 MHz to 66 MHz and finally 133 MHz this job become more and more difficult.

The original point of a common clock is that one device ( say the CPU ) could clock data out and the receiving device ( say a peripheral card ) would clock the data in on the next clock edge using the same clock. This worked fine at 33 MHz, but as clocks got faster and faster the clock tree insertion delay added more and more skew.

Chat-GPT4 : Clock Insertion Delay (or Clock Latency) refers to the time it takes for the clock signal to travel from its source (the clock definition point) to the flip-flops (registers) that receive the signal.

In AMD/Xilinx parlance, a clock tree is a BUFG primitive. It is common to think of a clock tree as a wire, or maybe a single gate. In reality they tend to be a long series of inverter gates forming a wide fanout tree. That large tree has significant capacitance and gate propagation delay.

At 33 MHz ( 30 ns period ) a clock insertion delay of 5-10 ns for a large chip is manageable. At 133 MHz ( 7.5 ns period ) that same clock insertion delay would be disastrous.

Around the 2000’s digital chips ( including FPGAs ) began including PLLs , which are little oscillators that can “lock” onto a reference clock and do fancy things like phase shift a clock or even multiply a clock to faster frequencies. Initially these PLLs were often used to remove the clock tree insertion delay by advancing the phase of the PLL generated clock until the phase at the clock tree leaf node aligned with the input reference clock.

Source : Wikipedia

Back in the early days when PLLs were slow ( capable of only running at your crystal clock frequency ), a neat trick they could do was entirely remove the clock tree insertion delay. By providing the PLL with both the reference clock pin input and a leaf node of the clock tree as feed back, the PLL could advance its generated clock phase relative to the reference clock until the clock tree and reference clock were phase aligned. Early PLLs seemed to predict the future and it was magical.

Modern PCIe uses SERDES to transmit and receive data at rates of between 2.5 Gb/s (PCIe Gen-1.1 2005 ) to 64 Gb/s ( PCIe Gen-6.0 2021 ). Every PCIe slot, even the x1, still has a 100 MHz reference clock as part of a Common Clock architecture. It’s the equivalent of the 14.31818 MHz clock of the 1980s. Now, using PLLs instead of dividing DOWN the master clock to get 1.023 MHz ( 6502 ) or 4.77 MHz ( 8088 ), PLLs are used to multiply UP frequencies. From 100 MHz to PCIe Gen-1.0 rates of 1 GHz or greater.

Early FPGAs like the 350nm XC4036XL had a reasonable, but fairly limited ( eight ) number for BUFG global clock trees. The 28nm 7-Series from AMD/Xilinx supports up to 32. Don’t go wild. A good design practice is to not deliberately go overboard on the number of global clock trees.

A very simple clock implementation in Verilog targeting 7-Series might look like this:

module top
(
  input  wire clk_100m   
);// module top

  wire        clk_100m_loc;
  wire        clk_100m_tree;
  reg  [3:0]  my_cnt = 4'd0;

  IBUF u0_ibuf ( .I( clk_100m     ), .O( clk_100m_loc  ) );
  BUFG u0_bufg ( .I( clk_100m_loc ), .O( clk_100m_tree ) );

always @ ( posedge clk_100m_tree ) begin
  my_cnt <= my_cnt[3:0] + 1;
end

Note that both IBUF and BUFG are required. IBUF translates the external pin voltage ( 3.3V as an example ) to the internal core FPGA voltage ( 1.0V as an example ). BUFG is the actual clock tree that is large, slow ( large Tpd relative to the IBUF ) and capable of driving thousands of loads.

Distributing a clock across a PCB using LVDS is fairly common, which would require a IBUFDS instead of an IBUF.

module top
(
  input  wire clk_100m_p,
  input  wire clk_100m_n
);// module top

  IBUF u0_ibufds (.I(clk_100m_p), .IB(clk_100m_n), .O(clk_100m_loc));
  BUFG u0_bufg   (.I( clk_100m_loc), .O( clk_100m_tree ));

So what happens if you don’t instantiate the IBUF and/or the BUFG? Most modern synthesis tools will spot your mistake and go ahead and infer them both for you.

module top
(
  input  wire clk_100m
);// module top

  reg  [3:0]  my_cnt = 4'd0;

always @ ( posedge clk_100m ) begin
  my_cnt <= my_cnt[3:0] + 1;
end

I you don’t believe me, build the above and export a structural netlist and look for IBUF and BUFG.

A basic rule of digital logic design is that a flip-flop can have but one clock. A BUFGMUX primitive can sort of break that rule by placing a MUX directly at the trunk of the BUFG clock tree. The tricky part is controlling that mux and cleanly making a transition between two clock domains. Reasons for doing this? Power savings or perhaps a design specific requirement like interfacing to an external ADC that must run at two different sample frequencies.

module top
(
  input  wire ck_sel,  
  input  wire clk_100m,
  input  wire clk_125m
);// module top

  IBUF u0_ibuf ( .I( clk_100m ), .O( clk_100m_loc ) );
  IBUF u1_ibuf ( .I( clk_125m ), .O( clk_125m_loc ) );

  BUFGMUX u0_bufgmux (  .S( ck_sel        ),
                       .I0( clk_100m_loc  ),
                       .I1( clk_125m_loc  ), 
                        .O( clk_muxd_tree ) );

always @ ( posedge clk_muxd_tree ) begin
  my_cnt <= my_cnt[3:0] + 1;
end

It’s possible to synthesize your own clocks from a master clock using integer divides. This is rather “old school” but still works. Unfortunately it involves PIP routing for the net clk_25m_loc, so there will likely be varying skew between the two clock trees from build to build.

module top
(
  input  wire clk_100m
);// module top

  wire        clk_100m_loc;
  wire        clk_100m_tree;
  wire        clk_25m_loc;
  wire        clk_25m_tree;
  reg  [1:0]  div4_cnt = 2'd0;

  IBUF u0_ibuf ( .I( clk_100m     ), .O( clk_100m_loc  ) );
  BUFG u0_bufg ( .I( clk_100m_loc ), .O( clk_100m_tree ) );
  BUFG u1_bufg ( .I( clk_25m_loc  ), .O( clk_25m_tree  ) );

always @ ( posedge clk_100m_tree ) begin
  div4_cnt    <= div4_cnt[1:0] + 1;
  clk_25m_loc <= div4_cnt[1];
end

A common practice these days is to use a PLL to generate a fast local PLL clock which is then divided down to generate multiple BUFG global clocks of FPGA friendly frequencies. For AMD/Xilinx 7-Series I recommend looking up MMCME2_ADV. It’s a bit much to instantiate one here, but an example design using a MMCM2E_ADV with a wrapper (my_pll) might look like this.

module top
(
  input  wire reset,   
  input  wire clk_100m
);// module top

  IBUF u0_ibuf ( .I( clk_100m     ), .O( clk_100m_in   ) );
  BUFG u0_bufg ( .I( clk_100m_loc ), .O( clk_100m_tree ) );
  BUFG u1_bufg ( .I( clk_133m_loc ), .O( clk_133m_tree ) );
  BUFG u2_bufg ( .I( clk_160m_loc ), .O( clk_160m_tree ) );

my_pll u0_my_pll
(
 .reset    ( reset        ),
 .pll_lock ( pll_lock     ),
 .clk_ref  ( clk_100m_in  ), // x8 = 800 MHz
 .clk_100m ( clk_100m_loc ), // Div-8 = 100 MHz
 .clk_133m ( clk_133m_loc ), // Div-6 = 133 MHz
 .clk_160m ( clk_160m_loc )  // Div-5 = 160 MHz
);

This type of clocking architecture supports maintaining a single crystal oscillator across a system while still supporting a multitude of different chip localized clocks for things like PCIe, DRAM interface, etc.

Another clocking topic is regional clock trees ( BUFRs ) which are useful for small and fast external non-SERDES chip-to-chip interfacing. A common interface technique is to generate a “source-synchronous” clock at the sending device which is forwarded along with data to a receiving device. Since they have the same time of flight and I/O standard, skew between them is minimal ( a few 100 ps typically ). At the receiving device the received clock goes into a BUFR and is used only to sample the data and push immediately into a dual-clock FIFO. The pop side of the FIFO is then a BUFG. Same frequency, but not phase aligned ( and because of the FIFO, it doesn’t matter ). Lookup ISERDES and OSERDES in the 7-Series for more details.

Then there’s BUFH ( Horizontal Clock Tree ) buffers – which I have yet to use, but looks interesting.

That’s it for clocking for now. I’ll add more later if I think of things I forgot.

EOF

http://blackmesalabs.wordpress.com/?p=2789

Extensions

BML FPGA Design Tutorial Part-13ofN

kevinhub88 Jul 21, 2024 Updated Jul 22, 2024

FIFOs ( First-In, First-Out ) 2024.07.21 : I’m BSEE Kevin Hubbard from Seattle, WA. Starting in 1995 with a 0.35µm (350nm) Xilinx 4036XL, I’ve spent the majority of my 30+ year career designing digital logic chips ( ASICs and FPGAs ) for embedded systems. It’s been amazing watching Moore’s Law enable FPGAs to become more […]

Show full content

FIFOs ( First-In, First-Out )

2024.07.21 : I’m BSEE Kevin Hubbard from Seattle, WA. Starting in 1995 with a 0.35µm (350nm) Xilinx 4036XL, I’ve spent the majority of my 30+ year career designing digital logic chips ( ASICs and FPGAs ) for embedded systems. It’s been amazing watching Moore’s Law enable FPGAs to become more and more powerful over the decades. In many industries they’ve overtaken ASICs for signal processing. It’s been a challenging career for sure and I’ve learned a lot. I’m giving back now in writing this “Getting started with FPGAs” series which starts here. It’s a memoir of my technical FPGA knowledge gained over the decades. I truly hope that others may learn from it.

Looking back, amongst my peers, my specialty is fast and small circuit designs – which isn’t to be confused with small chip designs. I actually work on really large chip designs. A chip is built up from many components with all of those components containing little digital circuits ( counters, state machines, etc ).

What I’ve discovered over the decades is that young engineers ( those under 50 kids who didn’t grow up in the era of the Ancients ) have an unchecked tendency to write bloated RTL circuit designs. Circuits that simulate fine and work just fine, but take up more flip-flops and LUTs than are absolutely necessary.

Always ask yourself – “Could this be done with fewer gates?” Never be afraid to throw away a working module and write it again. Refactoring your RTL is NOT a waste of time. It’s like what Oscar Goldman once said : “We can rebuild him. We have the technology. We can make him better than he was. Better, stronger, faster.” And yeah, Steve Austin below appears to be wearing a suicide-vest, WTF?

I don’t say this disparagingly at all. If you didn’t survive the 1942-1943 siege of Stalingrad, I really wouldn’t expect you to know how to make rat meat and wallpaper paste stew either. Same holds true for designing really small digital circuits if you didn’t grow up in the 1980’s and 1990’s era of 22V10 PALs and 7064 CPLDs. An absolutely ridiculous false equivalency comparison, I know, but it’s my blog. I’ve got the “Talking Pillow”.

With any luck, in the 2040 or 2050’s I will be called out of retirement ( like today’s Cobol software engineers ) to make some change in a production FPGA so that the modified design both fits and makes timing. That’s assuming sub 1nm FPGAs are still built using CMOS transistors, interface via electrons and not photons and are still designed in RTL. I look forward to being one of those “Ancients” some day.

In Part-12 I explained internal FPGA RAMs and ROMs. In Part-13 I will explain FIFOs – which are special control blocks that use embedded RAMs for both transferring data ( reliably ) between dissimilar clock domains and temporarily storing data until a destination resource is available.

A FIFO is a LOT like a popular ride line at Disneyland. They both do in-order rate converting. A ride can only carry so many people at a time. As people slowly stroll into the line (push), they stack up in order. When the previous ride finishes, the queue then rapidly takes out ( pops ) N number of people. Now the people entering the line are at one pace ( slow, but relatively constant average ) while the people getting out of the line ( and into the ride ) are at a much different pace ( very fast, but bursty – only when the ride cars are emptied ).

Those two dissimilar rates are like two dissimilar clock domains. FIFOs are the electronic equivalent of a Disneyland line ( queue ) that connects the two without resulting in public mayhem of many Mouseketeer Ears getting broken.

For ASIC and FPGA designs, transferring information from one clock domain to another is a necessary and complicated evil. On the surface, it may seem trivial to transfer bits from say an external 100 MHz bus to an internal 133 MHz core fabric clock – but it isn’t. It’s complicated.

In the simplest scenario you can transfer a “false path” signal between domains using synchronizer flops that protect against metastability. This works fine and well for things like semaphore signaling where you are allowed multiple clock cycles to get a single bit of information across domains. This falls apart though when you need to send a new bit of information on every clock. For this, we need FIFOs.

A FIFO is a complicated bit of logic that uses a dual-clock RAM to safely transfer information at full-speed from one clock domain to another. The trick is to have a controller ( logic circuits ) that operates on both clock domains and ensures that data coming in ( “Write Port” ) never gets written to the same RAM location as data going out ( “Read Port” ).

The super tricky part is determining on both clock domains how much data is still in the FIFO. “Split an atom? I just need a really sharp chisel and a hammer.” Actually, Oppenheimer, it’s a bit more complicated than that.

I’m going to say something and I truly mean this: Don’t attempt to design your own dual-clock FIFO controller. It’s a complicated and an already solved problem. It’s easy to get it right for 99.999% of the push-pop clock cycles. That last 0.001% cycle matters. Trust me on this. USE YOUR VENDOR’S FIFO IP!!

Now after saying all that, I’m going to design and share with you a really simple single-clock FIFO. I do this ONLY as a demonstration of what goes into a FIFO. What good is a single clock FIFO? They’re useful if you have a shared resource that requires arbitration. As an example, maybe two bus masters that both want to write simultaneously to a single device ( like a DRAM ) that has a single access port. Two FIFOs can be used to hold (store) write requests between the two masters until the single device is free.

My single clock FIFO is VERY simple, but also a bit long to post in this blog, so I will break it up into 5 parts which you may choose to concatenate together. Header, Write Port, Read Port, Full Counter and Flags. If you copy and paste these sections all together, you’ll get “fifo.v” which can be simulated.

fifo.v : Header : It declares the size of the FIFO ( 8 bits wide, 16 deep ) using constant parameters. This makes it easy to adjust in the future for different aspect ratios. There are a common asynchronous “reset” net and “clk”. An “A-Port” for writing (pushing) and a “B-Port” for reading (popping) data in and out of the FIFO. A handful of flags indicate when the FIFO is full ( or almost ) or empty ( or almost ). The actual RAM is inferred.

`timescale 1 ns/ 100 ps
`default_nettype none // Strictly enforce all nets to be declared

module fifo #
(
  parameter depth_len  = 16,
  parameter depth_bits = 4,
  parameter width_bits = 8
)
(
  input  wire                  reset, 
  input  wire                  clk, 
  input  wire                  a_push_en,
  input  wire [width_bits-1:0] a_di,
  input  wire                  b_pop_en,
  output reg  [width_bits-1:0] b_do,
  output reg                   b_rdy,
  output reg                   flag_empty,
  output reg                   flag_almost_empty,
  output reg                   flag_almost_full,
  output reg                   flag_full
);


// ram_style : registers,distributed,block,ultra,mixed,auto
(* ram_style = "block" *) reg  [width_bits-1:0] ram_array[depth_len-1:0];
  reg  [depth_bits-1:0]   a_addr;
  reg  [depth_bits-1:0]   b_addr;
  reg  [depth_bits-1:0]   full_cnt;

fifo.v : Write Port : It’s just THIS simple. When a push comes in, the data is written to the RAM’s a_addr and the address is incremented by +1 ready for the next push. What happens when the address reaches the top? It just wraps around back to 0 – and that is okay. FIFOs are “circular” in their operation. The write and read pointers just go around and around in circles 0,1,2,3,…15,0,1,2,3..15 . Cool, huh?

//----------------------------------------------------------------
// Write Port of RAM
//----------------------------------------------------------------
always @( posedge clk or posedge reset )
begin
  if ( reset == 1 ) begin
    a_addr <= 0;
  end else begin
    if ( a_push_en == 1 ) begin
      ram_array[a_addr][width_bits-1:0] <= a_di[width_bits-1:0];
      a_addr                            <= a_addr[depth_bits-1:0]+1;
    end // if ( a_we )
  end
end // always

fifo.v : Read Port : It’s also just THIS simple. When a pop request comes in, the data is read from the RAM’s b_addr and the address is incremented by +1 ready for the next pop. It rolls around from max to min just like the write pointer.

//----------------------------------------------------------------
// Read Port of RAM
//----------------------------------------------------------------
always @( posedge clk or posedge reset )
begin
  if ( reset == 1 ) begin
    b_addr <= 0;
    b_do   <= 0;
    b_rdy  <= 0;
  end else begin
    b_rdy  <= 0;
    if ( b_pop_en == 1 ) begin
      b_rdy  <= 1;
      b_do   <= ram_array[b_addr];
      b_addr <= b_addr[depth_bits-1:0]+1;
    end // if ( b_pop_en == 1 )
  end
end // always

fifo.v : Full Counter : The full counter keeps track of how many things are in the FIFO. For a single-clock domain, it’s super simple. You either increment the count on a push, or decrement on a pop. You keep the count the same if there’s no push and no pop OR if there is a push AND a pop at the same time.

//----------------------------------------------------------------
// Full Counter
//----------------------------------------------------------------
always @( posedge clk or posedge reset )
begin
  if ( reset == 1 ) begin
    full_cnt   <= 0;
  end else begin
    if          ( a_push_en == 1 && b_pop_en == 0 ) begin
      full_cnt <= full_cnt + 1;
    end else if ( a_push_en == 0 && b_pop_en == 1 ) begin
      full_cnt <= full_cnt - 1;
    end else begin
      full_cnt <= full_cnt[depth_bits-1:0];
    end
  end
end // always

fifo.v : Flags : The flags look at the full counter and create status flags to indicate when the FIFO can accept more pushes or pops.

//----------------------------------------------------------------
// Flags
//----------------------------------------------------------------
always @( posedge clk )
begin
  flag_empty        <= 0;
  flag_full         <= 0;
  flag_almost_empty <= 0;
  flag_almost_full  <= 0;

  if ( full_cnt[depth_bits-1:0] == 0 ) begin
    flag_empty <= 1;
  end

  if ( full_cnt[depth_bits-1:0] == 1 ) begin
    flag_almost_empty <= 1;
  end

  if ( full_cnt[depth_bits-1:0] == depth_len-1) begin
    flag_full  <= 1;
  end

  if ( full_cnt[depth_bits-1:0] == depth_len-2) begin
    flag_almost_full <= 1;
  end
end // always


endmodule // fifo
`default_nettype wire // enable default for 3rd party IP needing it

That’s it, that’s my simple single-clock FIFO. Notice I have no protection against over-pushing or over-popping. It’s up to the FIFO user to honor the flags. What if you can’t? Well then make the FIFO deeper. If even that doesn’t solve your rate conversion problem then you have a rate conversion problem that no FIFO can solve. The average input rate can’t exceed the average output rate. FIFOs are cool, but they can’t perform miracles.

In closing, I will demonstrate a simple simulation of this FIFO in action. It pushes 1,2,3,4,5,6,7,8 into the FIFO and then pops those values back out.

force_fifo.do :

force reset 1
force clk 0 5 ns, 1 10 ns -repeat 10 ns
force a_push_en 0
force b_pop_en 0
force a_di 16#0
run 10 ns
force reset 0
force a_push_en 1; force a_di 16#1; run 10 ns
force a_push_en 1; force a_di 16#2; run 10 ns
force a_push_en 0; run 10 ns;
force a_push_en 1; force a_di 16#3; run 10 ns
force a_push_en 1; force a_di 16#4; run 10 ns
force b_pop_en 1;
force a_push_en 1; force a_di 16#5; run 10 ns
force a_push_en 1; force a_di 16#6; run 10 ns
force a_push_en 1; force a_di 16#7; run 10 ns
force a_push_en 1; force a_di 16#8; run 10 ns
force a_push_en 0; force a_di 16#0; run 10 ns
force b_pop_en 0; run 10 ns;
force b_pop_en 1; run 30 ns;
force b_pop_en 0; run 10 ns;
run 10 ns;

The ModelSim simulation of fifo.v using this force file looks like this:

That’s it for Part-13, a simple introduction to FIFOs. I was hoping to get to generating and using an AMD/Xilinx FIFO primitive, but I think this right here above is enough for a chapter. Without getting too long, I think it’s a good introduction that both shows and demonstrates the innards of a simple single-clock FIFO. Again, dual-clock FIFOs are MUCH more complicated ( on the insides ), so don’t try to implement one yourself. It’s a solved problem which I will show how to use next time inside Vivado. Cheers.

EOF

http://blackmesalabs.wordpress.com/?p=2744

Extensions

https://blackmesalabs.wordpress.com/atom

Posts