AES RTL Implementation - Nathan Nguyen

Overview

This project provides a fully functional, synthesizable AES-128 encryption/decryption written in SystemVerilog. The implementation follows the FIPS-197 standard.

The algorithm consists of:

Initial round: AddRoundKey only
Rounds 1-9: SubBytes → ShiftRows → MixColumns → AddRoundKey
Final round: SubBytes → ShiftRows → AddRoundKey

Each round uses a unique round key generated by the key scheduler, ensuring identical plaintext blocks produce different intermediate states.

Quick Start

Prerequisites

ModelSim/QuestaSim
Python 3.7+
PyCryptodome

pip install pycryptodome

Running Tests

Generate Test Vectors

python generate_vectors.py

Run Test Suite

chmod +x run_tests.sh
./run_tests.sh

Run Individual Tests

Sbox/Inverse Sbox Test

rm work -Recurse; vsim -c -do "vlib work; vlog -sv *.sv; vsim work.tb_aes_sbox; run -all; quit"
rm work -Recurse; vsim -c -do "vlib work; vlog -sv *.sv; vsim work.tb_inv_aes_sbox; run -all; quit"

MixColumns/Inverse MixColumns test

rm work -Recurse; vsim -c -do "vlib work; vlog -sv *.sv; vsim work.tb_aes_mixw; run -all; quit"
rm work -Recurse; vsim -c -do "vlib work; vlog -sv *.sv; vsim work.tb_inv_aes_mixw; run -all; quit"

Key scheduling/Inverse KS test

rm work -Recurse; vsim -c -do "vlib work; vlog -sv *.sv; vsim work.tb_aes_key_scheduling; run -all; quit"
rm work -Recurse; vsim -c -do "vlib work; vlog -sv *.sv; vsim work.tb_inv_aes_key_scheduling; run -all; quit"

Full AES/Inverse AES test

rm work -Recurse; vsim -c -do "vlib work; vlog -sv *.sv; vsim work.tb_aes; run -all; quit"
rm work -Recurse; vsim -c -do "vlib work; vlog -sv *.sv; vsim work.tb_inv_aes; run -all; quit"

Architecture Details

State Machine

My implementation uses a 4-bit finite state machine that elegantly maps to the AES round structure:

State	Value	Description
IDLE	4'h0	Waiting for input data
Round 1-9	4'h1-4'h9	Middle rounds (full transformations)
FINAL	4'hA	Final round (no MixColumns)
DONE	4'hB	Output result

localparam round_t IDLE = 4'h0;   // Waiting for input data
localparam round_t FINAL = 4'hA;  // Round 10 - last transformation
localparam round_t DONE = 4'hB;   // Output result and reset

// FSM Control Logic
assign fsm_en = (|fsm_q) | data_valid_in;
assign finished_v = fsm_q[3] & fsm_q[1] & fsm_q[0];
assign {unused_fsm_sum_msb, fsm_next} = finished_v ? 5'b00000 : fsm_q + 4'b0001;
assign last_iter_v = fsm_q[3] & fsm_q[1];

This design uses combinational logic for all transformations, achieving single-cycle execution per round.

Data Flow

Initial Round: AddRoundKey only
Rounds 1-9: SubBytes → ShiftRows → MixColumns → AddRoundKey
Final Round: SubBytes → ShiftRows → AddRoundKey
Output: Result available with valid signal

Core Transformations

SubBytes: S-box Implementation

Rather than using lookup tables, I implemented the S-box using optimized combinational logic based on Boyar & Peralta's 2009 research. This approach reduces area and eliminates memory dependencies:

generate 
    for (i = 0; i < 16; i++) begin : loop_gen_sb_i				
        aes_sbox sbox(
            .data_in(data_q[(i*8)+7:(i*8)]),
            .data_out(sub_bytes[(i*8)+7:(i*8)])
        );
    end
endgenerate

The S-box module contains 67 internal logic terms implementing the mathematical transformation in GF(2^8), resulting in a compact design.

ShiftRows: Byte Permutation

ShiftRows is implemented as a pure wire remapping, making it zero-cost in hardware:

// First row (r = 0) is not shifted
assign shift_row[127:120] = sub_bytes[127:120];  // byte 0
assign shift_row[95:88]   = sub_bytes[95:88];    // byte 4

// Second row (r = 1) is cyclically left shifted by 1
assign shift_row[119:112] = sub_bytes[87:80];    // byte 1 <- byte 5
assign shift_row[87:80]   = sub_bytes[55:48];    // byte 5 <- byte 9
assign shift_row[55:48]   = sub_bytes[23:16];    // byte 9 <- byte 13
assign shift_row[23:16]   = sub_bytes[119:112];  // byte 13 <- byte 1

MixColumns: Galois Field Arithmetic

MixColumns performs matrix multiplication in GF(2^8). The key insight is that all operations can be implemented using XOR and a single multiplication by 2 (xtime):

module aes_mixw (
    input logic [31:0] w_i,
    output logic [31:0] mixw_o
);
    logic [7:0] b0, b1, b2, b3;
    logic [7:0] gm2_b0, gm2_b1, gm2_b2, gm2_b3;
    logic [7:0] gm3_b0, gm3_b1, gm3_b2, gm3_b3;
    
    // GF(2^8) multiplication by 2 (xtime)
    assign gm2_b0 = b0[7] ? ({b0[6:0], 1'b0} ^ 8'h1b) : {b0[6:0], 1'b0};
    
    // GF(2^8) multiplication by 3
    assign gm3_b0 = gm2_b0 ^ b0;
    
    // Matrix multiplication
    assign mb3 = gm2_b3 ^ gm3_b2 ^ b1 ^ b0;
    assign mb2 = b3 ^ gm2_b2 ^ gm3_b1 ^ b0;
    assign mb1 = b3 ^ b2 ^ gm2_b1 ^ gm3_b0;
    assign mb0 = gm3_b3 ^ b2 ^ b1 ^ gm2_b0;
    
    assign mixw_o = {mb3, mb2, mb1, mb0};
endmodule

Four parallel MixColumns units process all columns simultaneously, maintaining the single-cycle-per-round performance.

AddRoundKey

The simplest transformation is just XOR with the round key:

assign round_key_result = data_valid_in ? data_in : (last_iter_v ? shift_row : mix_columns);
assign data_next = round_key_result ^ key_current;

The multiplexer logic handles the different data paths for initial input, middle rounds, and final round.

Key Scheduler Implementation

The key scheduler generates round keys using three operations: RotWord, SubWord, and Rcon XOR:

module aes_key_scheduling(
    input logic [127:0] key_in,
    input logic [7:0] key_rcon_in,
    output logic [127:0] key_next_out,
    output logic [7:0] key_rcon_out
);

// RotWord: {a,b,c,d} -> {b,c,d,a}
assign temp_rotated = {w3[23:16], w3[15:8], w3[7:0], w3[31:24]};

// SubWord using S-box
aes_sbox sbox_0 (.data_in(temp_rotated[31:24]), .data_out(sbox_out_0));
aes_sbox sbox_1 (.data_in(temp_rotated[23:16]), .data_out(sbox_out_1));
aes_sbox sbox_2 (.data_in(temp_rotated[15:8]),  .data_out(sbox_out_2));
aes_sbox sbox_3 (.data_in(temp_rotated[7:0]),   .data_out(sbox_out_3));

assign temp_subbed = {sbox_out_0, sbox_out_1, sbox_out_2, sbox_out_3};
assign temp = temp_subbed ^ rcon_word;

// Generate new key words
assign w4 = w0 ^ temp;
assign w5 = w1 ^ w4;
assign w6 = w2 ^ w5;
assign w7 = w3 ^ w6;

The Rcon values follow a specific pattern in GF(2^8) implemented as a simple lookup function.

Decryption

Decryption uses inverse transformations in reverse order. The key insight is that the inverse operations can reuse much of the forward encryption logic:

InvSubBytes: Uses inverse S-box lookup table
InvShiftRows: Right shifts instead of left shifts
InvMixColumns: Matrix multiplication with inverse matrix
AddRoundKey: Same operation (XOR is self-inverse)

The inverse key scheduler works backward from the final round key to generate previous round keys:

// Inverse key expansion: calculate previous round key
assign w_prev3 = w2 ^ w3;
assign w_prev2 = w1 ^ w2;  
assign w_prev1 = w0 ^ w1;
assign w_prev0 = w0 ^ temp;  // temp derived from w_prev3

Verification and Testing

I created a Python script using PyCryptodome as a golden model to generate a small set of test vectors:

def write_test_vectors(test_vectors, output_dir='test_vec'):    
    for i, tv in enumerate(test_vectors):
        plaintext = tv['plaintext']
        key = tv['key']
        
        # Perform AES encryption
        cipher = AES.new(key, AES.MODE_ECB)
        ciphertext = cipher.encrypt(plaintext)
        
        # Write both binary and hex formats
        files['data_bin'].write(hex_to_binary_string(plaintext) + '\n')
        files['key_bin'].write(hex_to_binary_string(key) + '\n')
        files['res_bin'].write(hex_to_binary_string(ciphertext) + '\n')

Test Suite Architecture

The verification environment includes:

Component-level tests: Individual S-box, MixColumns, and key scheduler verification
System-level tests: Full encryption/decryption with NIST test vectors
Corner case testing: All zeros, all ones, and custom patterns
Roundtrip verification: Encrypt then decrypt to verify data integrity

Performance

Timing and Throughput

Latency: 12 clock cycles (10 encryption rounds + 2 control cycles)
Throughput: 1.067 Gbps @ 100 MHz
Efficiency: 10.67 bits per clock cycle

Resource Utilization

The design optimizes for balanced area/performance:

Logic elements: Moderate usage due to combinational S-box
Memory: Zero block RAM usage (no lookup tables)
DSP blocks: Not required
Routing: Clean due to regular structure

Resources

⚠️ Disclaimer: This implementation is for educational and research purposes. Production cryptographic systems require additional security analysis and validation.