8 ways to create a shift register in VHDL

There are many ways to create a shift register in VHDL, though not all of them are equal. You can dramatically reduce the number of consumed resources by choosing the right shift register implementation for your needs and FPGA architecture.

A shift register implements a FIFO of fixed length. Every time a new element enters the queue, it shifts the existing ones one place further away from the input. To understand the basics of the shift register, I recommend viewing the VHDL tutorial about the std_logic_vector.

This article will only consider the shift register, even though there exist data structures that use fewer resources for larger FIFOs. Read about how to create a ring buffer FIFO in block RAM to learn about such a solution that’s not a shift register.

While any shift register is suitable for creating generic, smaller buffers, there are methods of efficiently creating larger ones. Many FPGAs have logic elements that can double as specialized shift register primitives. You can improve performance in magnitudes by being mindful of how you write your VHDL code.

The 1-bit shift register with generic depth

Let’s first have a look at different methods of creating a one-bit shift register. The input and output of this shift register is a single bit, a std_logic value. The depth is configurable through a generic constant.

We will use the entity declaration shown below for all of the following examples involving one-bit shift registers. To keep it simple, we’re going to use the same entity for multiple architectures, even though the rst and enable inputs are unused by some of them.

entity shift_reg_1_width is
  generic (
    sr_depth : integer
  );
  port (
    clk : in std_logic;
 
    rst : in std_logic; -- Optional
    enable : in std_logic; -- Optional
 
    sr_in : in std_logic;
    sr_out : out std_logic
  );
end;

The three implementations that follow will synthesize into the same logic. These are the resources consumed for Xilinx, Intel (Altera), and Lattice FPGAs with the sr_depth generic set to 128.

Xilinx Vivado (Zynq):	4 LUTs (LUTRAM) + 2 FFs
Intel Quartus II (Cyclone V)	11 ALMs + 1 BRAM
Lattice IceCube2 (iCE40)	128 FFs

Even though the shift register should require 128 flip-flops (FFs), we see that the resource usage reported by Vivado and Quartus is far less. Instead of using expensive FFs, the synthesis tools have used special built-in features of the logic blocks.

Lookup tables (LUTs) used in such a way are counted as “LUTRAM” in the Vivado resource usage report. In the Intel FPGA, adaptive logic modules (ALMs) and one block RAM (BRAM) is used instead of flip-flops. Intel uses a technology that they call ALTSHIFT_TAPS for implementing RAM-based shift registers.

The Lattice iCE40 FPGA, which I used in this example, doesn’t have an alternative option for packing 1-bit shift register, so it’s implemented entirely in 128 FFs. However, we shall see later in this article that the Lattice device can pack wider shift registers into block RAM.

Slicing the vector

The most straightforward way to create a shift register is to use vector slicing. Insert the new element at one end of the vector, while simultaneously shifting all of the others one place closer to the output side. Put the code in a clocked process and tap the last bit in the vector, and you have your shift register.

architecture slicing of shift_reg_1_width is
 
  signal sr : std_logic_vector(sr_depth - 2 downto 0);
 
begin
 
  process(clk)
  begin
    if rising_edge(clk) then
 
      sr <= sr(sr'high - 1 downto sr'low) & sr_in;
      sr_out <= sr(sr'high);
 
    end if;
  end process;
 
end architecture slicing;

Using a for-loop

Just like vector slicing, the for-loop also works in all revisions of the VHDL language. This approach requires an additional line of code to assign the input to the vector. Remember that every iteration of the for-loop is executed in zero time in RTL code because there’s no wait-statement inside of it. Therefore, this code is logically equivalent to the previous example.

architecture for_loop of shift_reg_1_width is
 
  signal sr : std_logic_vector(sr_depth - 2 downto 0);
 
begin
 
  process(clk)
  begin
    if rising_edge(clk) then
 
      for i in sr'high downto sr'low + 1 loop
        sr(i) <= sr(i - 1);
      end loop;
 
      sr(sr'low) <= sr_in;
      sr_out <= sr(sr'high);
 
    end if;
  end process;
 
end architecture for_loop;

Using the shift_left function

While the previous examples work for vectors as well as arrays of any kind, using the shift_left function only works with bit vectors. The definition of the shift_left function and it’s complementary shift_right function appears in the ieee.numeric_std package. It requires an unsigned vector as the first parameter, and that’s why it’s not suitable for arrays of arbitrary data types.

architecture ieee_shift_left of shift_reg_1_width is
 
  signal sr : unsigned(sr_depth - 2 downto 0);
 
begin
 
  process(clk)
  begin
    if rising_edge(clk) then
 
      sr <= shift_left(sr, 1);
      sr(sr'low) <= sr_in;
      sr_out <= sr(sr'high);
 
    end if;
  end process;
 
end architecture ieee_shift_left;

Enable input

ilinx FDCE-flip-flop — Xilinx FDCE flip-flop primitive

Most FPGA architectures have flip-flops with an optional enable (E) or clock enable (CE) input. This functionality can’t be utilized by any other logic when you are using it for the shift register. Thus, the additional enable input won’t consume extra resources.

Wrap the code that’s responsible for shifting with an if enable = '1' then statement. Then, input and output from the shift register will still occur on the rising edge of the clock, but only when the enable input is asserted.

The code below shows the previous example with the enable input added to the implementation.

architecture with_enable of shift_reg_1_width is
 
  signal sr : unsigned(sr_depth - 2 downto 0);
 
begin
 
  process(clk)
  begin
    if rising_edge(clk) then
 
      if enable = '1' then
        sr <= shift_left(sr, 1);
        sr(sr'low) <= sr_in;
        sr_out <= sr(sr'high);
      end if;
 
    end if;
  end process;
 
end architecture with_enable;

The drawback of having a reset value

You should be cautious about adding reset values to the shift register vector or output. The problem is that it prevents the synthesis tool from packing the shift register into LUTs or BRAM. Consider the example below, which is the same as the first one in this article, but with synchronous reset added.

architecture slicing_with_rst of shift_reg_1_width is
 
  signal sr : std_logic_vector(sr_depth - 2 downto 0);
 
begin
 
  process(clk)
  begin
    if rising_edge(clk) then
      if rst = '1' then
        sr <= (others => '0');
        sr_out <= '0';
 
      else
        sr <= sr(sr'high - 1 downto sr'low) & sr_in;
        sr_out <= sr(sr'high);
 
      end if;
    end if;
  end process;
 
end architecture slicing_with_rst;

Now, the resource usage for both Xilinx and Intel FPGAs has jumped to 128 FFs. For Intel FPGAs, it makes the most sense to measure resource usage in ALMs, but they contain the same number of FFs as the other FPGAs used.

Xilinx Vivado (Zynq)	128 FFs
Intel Quartus II (Cyclone V)	33 ALMs
Lattice IceCube2 (iCE40)	128 FFs

The synchronous reset has forced the synthesis tool to implement the shift register entirely in FFs. Therefore, you should ask yourself if you need to be able to reset the entire shift register at once.

Using a counter to reset the output

After reset, everything that’s in the shift register is invalidated. Usually, you don’t care what the invalid data is, the purpose of the reset is to avoid passing it on to downstream modules. An alternative to resetting the entire shift register is to monitor where the valid data starts inside of it.

While invalid data is at the output, you forward your reset value. Then, when valid data reaches the output, you start sampling the real output from the shift register. As long as you don’t tap the shift register other than at the output, the behavior will be indistinguishable from true synchronous reset.

The code below uses a counter signal to implement synchronous reset of the shift register output.

architecture rst_counter of shift_reg_1_width is
 
  signal sr : std_logic_vector(sr_depth - 2 downto 0);
  signal rst_counter : integer range 0 to sr_depth - 1;
 
begin
 
  process(clk)
  begin
    if rising_edge(clk) then
      if rst = '1' then
        rst_counter <= 0;
        sr_out <= '0';
 
      else
 
        sr <= sr(sr'high - 1 downto sr'low) & sr_in;
 
        if rst_counter = sr_depth - 1 then
          sr_out <= sr(sr'high);
        else
          rst_counter <= rst_counter + 1;
          sr_out <= '0';
        end if;
 
      end if;
    end if;
  end process;
 
end architecture rst_counter;

As we can see from the resource usage below, Xilinx has gone back to using LUTRAM and Intel to ALMs and a single BRAM primitive. The implementation uses a bit more logic for the new counter, but overall the saving is substantial. The exception is the Lattice device, which consumes more than before because it doesn’t have primitives for optimizing this kind of shift register.

Xilinx Vivado (Zynq)	12 LUTs (4 LUTRAM) + 9 FFs
Intel Quartus II (Cyclone V)	17 ALMs + 1 BRAM
Lattice IceCube2 (iCE40)	18 LUTs + 135 FFs

Shift register with generic depth and width

Let’s proceed to look at an implementation of a shift register with configurable data width as well. The code below shows the entity used in the following examples. It has two generic inputs, one for setting the depth and one for the width of the data elements.

entity shift_reg_generic_width is
  generic (
    sr_depth : integer;
    sr_width : integer
  );
  port (
    clk : in std_logic;
    rst : in std_logic; -- Optional
 
    sr_in : in std_logic_vector(sr_width - 1 downto 0);
    sr_out : out std_logic_vector(sr_width - 1 downto 0)
  );
end;

In the examples below, we will assign 128 to the sr_depth generic and 16 to the sr_width generic.

Without reset

In the code below, we have converted the slicing example without reset to have configurable width as well as depth. We synthesize it with an input width of 16, meaning that it can store 16 times as many bits as the first example in this article. Let’s see if the resource usage is multiplied by 16 too.

architecture slicing of shift_reg_generic_width is
 
  type sr_type is array (sr_depth - 2 downto 0)
    of std_logic_vector(sr_width - 1 downto 0);
  signal sr : sr_type;
 
begin
 
  process(clk)
  begin
    if rising_edge(clk) then
 
      sr <= sr(sr'high - 1 downto sr'low) & sr_in;
      sr_out <= sr(sr'high);
 
    end if;
  end process;
 
end architecture slicing;

The resource utilization is listed below. Xilinx has gone from 4 to 64 LUTs, and from 2 to 32 FFs, 16 times the resources exactly. Intel, on the other hand, has risen from 11 ALMs to 20 ALMs, still using one BRAM. The reason for the modest increase is because the BRAM can accommodate far more data than we are asking for in the first place, it just needs a bit more control logic.

Xilinx Vivado (Zynq)	64 LUTs (LUTRAM) + 32 FFs
Intel Quartus II (Cyclone V)	20 ALMs + 1 BRAM
Lattice IceCube2 (iCE40)	18 LUTs + 1 BRAM + 25 FFs

Finally, the Lattice FPGA has managed to pack the wider shift register into block RAM. Its resource usage is now on par with Xilinx and Intel.

With a smart reset counter

The final example in this article is a shift register with generic width and depth, using synchronous reset. The code below shows the implementation which uses the reset counter that we discussed earlier in this article.

architecture counter_rst of shift_reg_generic_width is
 
  type sr_type is array (sr_depth - 2 downto 0)
    of std_logic_vector(sr_width - 1 downto 0);
  signal sr : sr_type;
 
signal rst_counter : integer range 0 to sr_depth - 1;
 
begin
 
  process(clk)
  begin
    if rising_edge(clk) then
      if rst = '1' then
        rst_counter <= 0;
        sr_out <= (others => '0');
 
      else
 
        sr <= sr(sr'high - 1 downto sr'low) & sr_in;
 
        if rst_counter = sr_depth - 1 then
          sr_out <= sr(sr'high);
        else
          rst_counter <= rst_counter + 1;
          sr_out <= (others => '0');
        end if;
 
      end if;
    end if;
  end process;
 
end architecture counter_rst;

We can see from the listing below that the Xilinx FPGA needs eight additional regular LUTs and seven FFs for implementing the counter reset. Intel Quartus II still somehow reports the same resource usage as without reset. The Lattice FPGA consumes 24 more LUTs and 31 more FFs for implementing the counter, but the shift register still fits in one BRAM.

Xilinx Vivado (Zynq)	72 LUTs (64 LUTRAM) + 39 FFs
Intel Quartus II (Cyclone V)	20 ALMs + 1 BRAM
Lattice IceCube2 (iCE40)	42 FFs + 1 BRAM + 56 FFs

Try out all the different shift registers in ModelSim for free. Use the form below to download the ModelSim project with all the code. Have it up and running within minutes!

Controlling the RAM usage with synthesis attributes

The synthesis tools will try to use the FPGA primitives that give the highest device utilization. The cheapest resource is BRAM, then comes distributed RAM or LUTRAM, and finally, FFs which are the most valuable.

But you can override the automatic choice by using a synthesis attribute, also known as a pragma or compiler directive. The different FPGA vendors have their own sets of VHDL attributes. To specify a desired primitive type, you define the attribute in the architecture region of the VHDL file, referencing your shift register array or vector by name.

Xilinx Vivado

The Xilinx UG901 user guide lists all synthesis attributes that are recognized by Vivado. The shreg_extract attribute and the srl_style attribute are the ones that control shift register synthesis.

Setting the shreg_extract attribute to "no" disables all shift register optimization. This setting acts like a master switch, overriding other SRL synthesis settings. You can also assign "yes" to shreg_extract, but this is the default setting anyway.

attribute shreg_extract : string;
-- attribute shreg_extract of sr : signal is "yes";
attribute shreg_extract of sr : signal is "no";

Remember to replace sr with the name of your shift register signal.

The next attribute of interest is srl_style. It’s a request to the synthesis tool to implement the shift register in a specific type of primitive. Note that this is not a magic pill. For example, you can’t force the synthesis tool to implement the shift register in block RAM when you insist on having reset values. It’s not possible.

attribute srl_style : string;
--attribute srl_style of sr : signal is "register";
--attribute srl_style of sr : signal is "srl";
--attribute srl_style of sr : signal is "srl_reg";
--attribute srl_style of sr : signal is "reg_srl_reg";
attribute srl_style of sr : signal is "block";

The possible values are:

`register`	Only use registers (aka flip-flops)
`srl`	Use only SRL structures
`srl_reg`	Use an SRL structure with one trailing register
`reg_srl:`	Use an SRL structure with one preceding register
`reg_srl_reg`	SRL structure with preceding and trailing registers
`block`	Use block RAM

Intel Quartus II

Shift register optimization in Intel FPGAs can be turned on or off by setting one of the following synthesis attributes. The default value is auto.

attribute altera_attribute : string;
-- attribute altera_attribute of sr :
--   signal is "-name AUTO_SHIFT_REGISTER_RECOGNITION always";
-- attribute altera_attribute of sr :
--   signal is "-name AUTO_SHIFT_REGISTER_RECOGNITION off";
attribute altera_attribute of sr :
  signal is "-name AUTO_SHIFT_REGISTER_RECOGNITION auto";

Refer to the Quartus Prime Settings File Reference Manual for a more detailed explanation of this and other synthesis attributes.

Lattice iCEcube2 / Synplify Pro

Lattice iCECube2 uses Synopsis Synplify Pro as its synthesis engine. Therefore, these attributes will also work with other vendors that use Synplify Pro too.

attribute syn_srlstyle : string;
-- attribute syn_srlstyle of sr : signal is "registers";
-- attribute syn_srlstyle of sr : signal is "distributed";
attribute syn_srlstyle of sr : signal is "block_ram";

The attribute lets you select between block RAM, registers, and distributed RAM (if available on the chip).

Final remarks

Shift registers are convenient for implementing small FIFOs, among other things. However, when creating large FIFOs, you should consider using a different data structure like a ring buffer or an AXI FIFO, which is more suitable for block RAM.

I should add that it’s difficult to compare resource usage between vendors. The architecture is different, LUTs have different sizes, and the reports from the place and route (PAR) tools come in various formats.

You need to take into consideration the target architecture when creating a shift register. That’s what you should take away from this article, not that one FPGA is better than the other.

Jonas Julian Jensen

I’m from Norway, but I live in Bangkok, Thailand. Before I started VHDLwhiz, I worked as an FPGA engineer in the defense industry. I earned my master’s degree in informatics at the University of Oslo.

In-depth

Delta cycles explained

ByJonas Julian Jensen October 23, 2018October 28, 2023

Delta cycles are non-time-consuming timesteps used by VHDL simulators for modeling events during execution of VHDL code. They are events that happen in zero simulation time after a preceding event. VHDL is a parallel programming language, while computers and CPUs work in a sequential manner. When a normal programming language is run, the CPU executes…

Tutorial

Stimulus file read in testbench using TEXTIO

ByJonas Julian Jensen November 5, 2019November 13, 2019

Reading signal values from file is an alternative way of generating stimuli for the device on test (DUT). The testbench sequence and timing is hard-coded in a stimulus file that is read by the VHDL testbench, line by line. This allows you to easily change the pattern of the waveform that you want to feed…

Opinion

Why you need to learn Tcl

ByJonas Julian Jensen May 14, 2019October 13, 2022

Tcl is the programming language that goes hand in hand with VHDL. You may choose to learn Verilog instead of VHDL, but you will be exposed to Tcl no matter which HDL you decide to use. That is because most FPGA-related programs, such as simulators and synthesis tools, use Tcl in their command shells. Having…

Opinion

Should you use an FPGA or a microcontroller?

ByJonas Julian Jensen March 31, 2020March 31, 2020

Many people turn to me for advice regarding their FPGA projects. Often, my answer is not what they expected: have you considered using a microcontroller instead of an FPGA? I’m running VHDLwhiz, the website dedicated to FPGA design using VHDL. My life revolves around FPGAs, and I’m the biggest proponent of VHDL there is. Why would…

Opinion

Using variables for registers or memory in VHDL

ByJonas Julian Jensen May 7, 2020January 13, 2024

One question that I’ve debated many times over the years is whether it’s OK to use variables for registers in VHDL. It’s safe to say that newbies are more likely to do it than experienced VHDL designers. But is there any merit to that, or is it just a matter of preference? In this blog…

Tutorial

How to create a breathing LED effect using a sine wave stored in block RAM

ByJonas Julian Jensen May 26, 2020May 19, 2020

I’ve noticed that many of the gadgets that I’ve bought the last couple of years have shifted away from LED blinking to led breathing. Most electronic gizmos contain a status LED whose behavior gives away indications of what’s going on inside of the device. My electric toothbrush flashes an LED when it’s time to recharge…

14 Comments

Franki Lee says:

May 7, 2021 at 12:29

Hello.
You discussed only logic type of data. If an input vector is integer data type, we can’t use the same operations, I believe. Can I shift an integer vector ( without using a conversion into logic vector)?

Thank you in advance!

FL

Reply
1. Jonas Julian Jensen says:
  
  May 7, 2021 at 14:24
  You can shift an array of integers in a similar fashion. But instead of referencing the signal itself to get the 'high and 'low indexes, you must use the type here.
  
  Here’s the vector slicing example modified to use an integer array instead of SLV:
```
  signal arr_in : integer := 42;
  signal arr_out : integer;

  type arr_type is array (0 to 7) of integer;
  signal arr : arr_type := (others => 0);

begin

  process(clk)
  begin
    if rising_edge(clk) then
  
      arr <= arr(arr_type'low + 1 to arr_type'high) & arr_in;
      arr_out <= arr(arr'low);
  
    end if;
  end process;
```
  Reply
  1. Fl says:
    
    May 10, 2021 at 06:49
    
    in your example, you have used one input ( arr_in =42), it means the arr_out is one value (=42), isnt?
    
    Reply
    1. Jonas Julian Jensen says:
      
      May 10, 2021 at 07:09
      
      Yes. 42 value is just a random constant value that I picked. You should assign to arr_in whatever you want. That’s your input to the shift register.
      
      Reply
      1. Fl says:
        
        May 10, 2021 at 09:18
        
        why have you taken the low bound (“arr_out <= arr(arr'low);")?
      2. FL says:
        
        May 10, 2021 at 09:45
        
        I have tested your suggestion:
        
        entity main_integer is Port ( data_in : in integer; data_out : out integer ); end main_integer; architecture Behavioral of main_integer is type arr_type is array (0 to 11) of integer; signal arr : arr_type := (others => 0); begin arr <= arr(arr_type'low + 1 to arr_type'high) & data_in; data_out <= arr(arr'low); end Behavioral;
        
        and create a test bench. I have gotten the input data as output.
      3. Jonas Julian Jensen says:
        
        May 10, 2021 at 17:00
        
        That’s great! Seems like you have it under control then.
        
        And to answer your questions: We are assigning the lowest index arr(arr'low) to the output because that’s the oldest element in the shift register. In this example, we assign new values to the high side.
Starlrd says:

October 6, 2021 at 06:58

Shift register using jk flip-flop
Vdhl code

Reply
1. Jonas Julian Jensen says:
  
  October 6, 2021 at 08:26
  
  Adding reset to your shift-register process will force the synthesizer tool to implement it using FFs. See the “THE DRAWBACK OF HAVING A RESET VALUE” section in the article above.
  
  Reply
  1. Starlrd says:
    
    October 7, 2021 at 08:07
    
    LIBRARY ieee;
    USE ieee.std_logic_1164.all;
    Entity reg8 IS
    PORT( J :IN STD_LOGIC_VECTOR(7 DOWNTO 0);
    Resetn, Clock :IN STD_LOGIC;
    K :IN STD_LOGIC_VECTOR(7 DOWNTO 0);
    Q :OUT STD_LOGIC_VECTOR(7 DOWNTO 0));
    END reg8;
    ARCHITECTURE behavioral OF reg8 IS
    BEGIN
    PROCESS ( Resetn, Clock)
    BEGIN
    IF Resetn =’0′ THEN
    Q<="00000000";
    ELSEIF rising_edge(Clock) THEN
    Q<=J;
    END IF;
    END PROCESS;
    END behavioral;
    Is it correct
    
    Reply
Elsa says:

April 20, 2022 at 21:35
if I don;t do ang change on the input data and just let the input go out through the output.
I can use your example;
```
s16_register <= s16_register(14 downto 0) &input;
 output <= s16_register (15);
```
I was wondering how you implement the modification of the data in a register which data matches a certain value. then send this changed data out with the rest input through output .I tried this way,
```
if rising_edge (clk)
    if (s16_register = x"0088")then
            s16_register <=  s16_register + x"0100";
    else
            s16_register <= s16_register(14 downto 0) & input;
    end if;
end if;
 output <= s16_register (15);
```
The simulation shows, the code just ignores my modification. Any suggestions? Thank you in advance.
Reply
1. Jonas Julian Jensen says:
  
  April 22, 2022 at 04:51
  
  Hello, Elsa. It looks like that should work, but I don’t know the context of this code, so perhaps the error is elsewhere.
  
  Try to set a breakpoint in your VHDL simulator and see if line 3 of your last example is indeed hit.
  
  Reply
Salvatore Cardamone says:

October 2, 2022 at 21:55

Thanks for the great content! Was just curious about the `rst_counter` architecture of the `shift_reg_1_width` entity. You have the ranged integer `rst_counter` that’s simply incremented in the process without any explicit resetting to zero; is there anything in the VHDL standard that guarantees a ranged integer will wrap?

When I attempt to run this with GHDL, for example, I get a bound-check failure at this line during simulation. I was under the impression that ranged integers were used to bound things like generics and make sure the user didn’t try to use a block improperly.

Many thanks!

Reply
1. Jonas Julian Jensen says:
  
  December 3, 2022 at 04:48
  Hello, Salvatore. Thanks for commenting!
  
  I’m unsure of why you are getting the bound-check failure in GHDL. It’s a pretty common design pattern to size integers based on generic constant values.
  
  Integers do not wrap in VHDL. That will be a run-time error in simulation and undefined behavior in synthesis. Although they may wrap if you are lucky with the implementation. But you should definitely not rely on it.
  
  You have to make sure never to assign out-of-bounds values to VHDL integers as I did in these code lines:
```
if rst_counter = sr_depth - 1 then -- If at the upper bound
   sr_out <= sr(sr'high);
else
   rst_counter <= rst_counter + 1;
   sr_out <= (others => '0');
end if;
```
  Reply

The 1-bit shift register with generic depth

Slicing the vector

Using a for-loop

Using the shift_left function

Enable input

The drawback of having a reset value

Using a counter to reset the output

Shift register with generic depth and width

Without reset

With a smart reset counter

Controlling the RAM usage with synthesis attributes

Xilinx Vivado

Intel Quartus II

Lattice iCEcube2 / Synplify Pro

Final remarks

Similar Posts

14 Comments

Leave a Reply Cancel reply