VHDL module: Unicode (UTF-8) decoder and encoder

UTF-8 decoder and encoder modules that let your text-processing VHDL project operate on 21-bit Unicode code points instead of variable-length byte streams.

Category: Tags: ,

Description

These UTF-8 decoder and encoder modules let your text-processing VHDL project operate on 21-bit Unicode code points instead of variable-length byte streams.

In addition to the UTF-8 decoder and encoder, the project contains testbenches for the modules and a demo loopback FPGA implementation that sends text read from UART through the decoder and encoder and back to the computer.


What happens when your VHDL design meets an emoji or a non-English character?

If you’re going to make a professional VHDL design that operates on text, you will certainly run into this problem:

To represent all types of symbols/characters/languages, you need to support Unicode. UTF-8, the most used Unicode character encoding scheme, uses variable-length byte sequences.

It uses only one byte for common (English) letters, but two, three, and up to four bytes for more exotic symbols (like emojis and some Chinese characters).

That’s great because it saves space, but it creates headaches for FPGA designers. How do you make a text-processing system that works on variable-length symbols?

VHDL UTF-8 decoder and encoder

The decoder and encoder modules solve this problem by converting between UTF-8 and fixed-length 21-bit words, which uniquely represent any Unicode character.

The project contains the RTL code, two testbenches, a loopback demo FPGA implementation, and a PDF user guide.

And as usual, I’ve made a video usage guide to make it easy for you to understand how the modules work and how to use them:

This project is only available in the VHDLwhiz Membership.

The membership subscription gives you access to this and many other VHDL resources and courses.

You pay monthly to access the membership and can cancel the automatic renewal anytime. There is no lock-in period or hidden fees.

Entity of utf8_decoder VHDL module

(You get the complete VHDL module in the downloadable Zip)

entity utf8_decoder is
  port (
    clk : in std_logic;
    rst : in std_logic;
    
    -- The next byte in the UTF-8 text
    in_byte : in std_logic_vector(7 downto 0);
    in_valid : in std_logic;
    in_ready : out std_logic;
    
    -- Decoded code point pointing to a unique Unicode char
    out_codepoint : out std_logic_vector(20 downto 0);
    out_valid : out std_logic;
    out_ready : in std_logic;

    -- Flags assert if the prev byte caused a Unicode decode error
    err_unexp_lead_byte : out std_logic; -- Lead byte while expecting continuation
    err_unexp_cont_byte : out std_logic; -- Continuation byte while expecting lead
    err_utf16_surrogate : out std_logic; -- Invalid surrogate code point (U+D800-U+DFFF)
    err_invalid_byte_val : out std_logic -- Byte value that never appears in UTF-8
  );
end utf8_decoder;

Entity of utf8_encoder VHDL module

entity utf8_encoder is
  port (
    clk : in std_logic;
    rst : in std_logic;

    -- Code point representing a unique Unicode char
    in_codepoint : in std_logic_vector(20 downto 0);
    in_valid : in std_logic;
    in_ready : out std_logic;
    
    -- The code point translated into 1-4 UTF-8 bytes
    out_byte : out std_logic_vector(7 downto 0);
    out_valid : out std_logic;
    out_ready : in std_logic
  );
end utf8_encoder;

Zip content

Here’s the list of files included in the project (plus the usage guide video):

unicode/
├── LICENSE.txt
├── VHDLwhiz Unicode - User Manual.pdf
├── decoder
│   ├── run.do
│   ├── utf8_decoder.vhd
│   ├── utf8_decoder_tb.vhd
│   └── wave.do
├── encoder
│   ├── input_codepoints.txt
│   ├── output_UTF-8.txt
│   ├── run.do
│   ├── utf8_encoder.vhd
│   ├── utf8_encoder_tb.vhd
│   └── wave.do
├── loopback_demo
│   ├── Cmod-A7.xdc
│   ├── axi_fifo.vhd
│   ├── reset_sync.vhd
│   ├── top.vhd
│   ├── uart_buffered.vhd
│   ├── uart_rx.vhd
│   └── uart_tx.vhd
├── questa_proj
│   └── unicode.mpf
├── txt_files
│   ├── ASCII-128-chars.txt
│   ├── UTF-8-demo.txt
│   └── UTF-8-test.txt
├── vivado_proj
│   └── create_vivado_proj.tcl
└── vw_print_pkg.vhd

How to get started

Join the VHDLwhiz Membership now to play with the new project and see if you can create a simple text-processing module to insert between the decoder and encoder.

Hint: Start with a module that translates any character (also non-ASCII) to upper-case. It can be purely combinational so that you don’t need to worry about flow control.

This project is only available in the VHDLwhiz Membership.

The membership subscription gives you access to this and many other VHDL resources and courses.

You pay monthly to access the membership and can cancel the automatic renewal anytime. There is no lock-in period or hidden fees.

Reviews

There are no reviews yet.

Be the first to review “VHDL module: Unicode (UTF-8) decoder and encoder”

Your email address will not be published. Required fields are marked *