Optimizing SPI communication on STM32 MCUs: a comprehensive guide to high-frequency communication

STM32 MCUs come with various peripherals, one of them is SPI (Serial Peripheral Interface) which is a simple serial bus interface commonly used for short-distance communication between various devices. SPI is one of the interfaces used by TPM chips for communication with PC motherboard. SPI uses 4 lines for communication: MOSI, MISO, SCK, SS, which are described down below.

The device must implement the TPM protocol to work as a TPM module. TPM protocol works by transmitting variable-length frames over SPI. The 4-byte TPM header contains fields describing the length of the data payload, the address of the target TPM register, and the transfer direction (read or write). TPM protocol has its own means of handling flow control (as there isn’t a standard flow control mechanism on SPI) and for doing bus aborts.

TPM must be able to operate at frequencies from 10 MHz to 24 MHz to comply with the TCG PTP specification (see section 7.4.1 and TwPM Documentation). Getting SPI right on such high frequencies is a significant challenge, especially when operating as a slave.TPM-specific features complicate things further. Some platforms require TPMs to support higher frequencies. PTP specification encourages support for the 33-66 MHz range in addition to the required range of 10-24 MHz, and future versions of the specification may mandate higher frequencies, so the platform should be capable of handling them.

Limitations of STM32L476

STM32L476 has SPI capable of frequencies up to a (theoretical) limit of 40 MHz, which is half of the maximum clock that can be provided to Cortex-M and AHB/APB buses. SPI capabilities are described in the STM32 Programming Manual (RM0351) section 42.2.

Another limiting factor is maximum GPIO speed, which depends on operating conditions such as the voltage provided to the MCU, ambient temperature, and parameters of cables used to connect the master and slave. GPIO limitations are described in the STM32L476RG datasheet in section 6.4. Table 72 describes the maximum frequency of GPIO outputs.

More problematic may be DMA limitations. The hard limit of DMA transfer speed would be 80 Mbits/s as 80 MHz is the maximum frequency that can be provided to the AHB bus and the MCU. The actual transfer speed may be lower due to AHB and APB protocol overhead, bus contention, etc. Unfortunately, the datasheet does not provide information about DMA transfer limitations.

Last but not least, the performance of the firmware itself is significant. Achieving target frequency may require extensive optimizations.

Creating SPI slave on Zephyr

Zephyr is our platform of choice primarily due to its portability (we will target non-STM32 platforms too). I will briefly describe the outcome of my early tests done on Zephyr and why it was terrible.

The application just transmits a static sequence of bytes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
const struct device *spi_dev = NULL;
struct spi_config spi_cfg = {
    .frequency = 25000000,
    .operation = SPI_WORD_SET(8) | SPI_TRANSFER_MSB | SPI_OP_MODE_SLAVE,
};

const uint8_t testdata[32] = {
    0x3d, 0x7f, 0x85, 0xc3, 0x86, 0x2f, 0x14, 0x2a, 0xa2, 0x67, 0x1d, 0xd7, 0xfa, 0xa8, 0x3a, 0x42,
    0xf8, 0x12, 0xd2, 0xa1, 0x04, 0xcc, 0xe2, 0xc6, 0x78, 0x73, 0x09, 0xe6, 0xd8, 0xc5, 0x0e, 0xba,
};

void main(void) {
    spi_dev = device_get_binding(DEVICE_DT_NAME(DT_NODELABEL(spi2)));
    if (!device_is_ready(spi_dev)) {
        printk("SPI controller is not ready!\n");
        return;
    }

    struct spi_buf spi_tx_buf = {
        .buf = (void *)testdata,
        .len = sizeof testdata,
    };

    struct spi_buf_set spi_tx_buf_set = {
        .buffers = &spi_tx_buf,
        .count = 1,
    };

    while (true) {
        int ret = spi_write(spi_dev, &spi_cfg, &spi_tx_buf_set);
        if (ret < 0)
            printk("spi_write failed: %d\n", ret);
        else
            printk("spi transfer complete\n");
    }
}

It is necessary to enable some configs:

1
2
3
4
CONFIG_GPIO=y
CONFIG_SPI=y
CONFIG_SPI_SLAVE=y
CONFIG_SPI_STM32_DMA=y

And configure DMA channels:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
&dma1 {
    status = "okay";
};

&spi1 {
    status = "disabled";
};

&spi2 {
    /*
     * See https://docs.zephyrproject.org/3.0.0/reference/devicetree/bindings/dma/st%2Cstm32-dma-v2.html
     */
    dmas = <&dma1 5 1 0x20440>,
           <&dma1 4 1 0x20480>;
    dma-names = "tx", "rx";
};

&spi3 {
    status = "disabled";
};

If we run this code we will see SPI errors running down the console:

1
2
3
4
5
spi_write failed: -11
spi_write failed: -11
spi_write failed: -11
spi_write failed: -11
spi_write failed: -11

This is the first problem with Zephyr’s SPI driver - each transfer has a one-second timeout. While this may be desirable behavior for SPI master (it could be used for error recovery, for example, to power cycle the slave if it doesn’t respond), it breaks SPI slave. The slave must be ready to give a response when the transfer commences - appropriate data must already be loaded in FIFO. Here we get stuck in an endless loop, queuing the transfer, aborting it, and queuing it again.

The problem can be worked around by patching the wait_dma_rx_tx_done function in spi_ll_stm32.c. The original function looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
static int wait_dma_rx_tx_done(const struct device *dev)
{
    struct spi_stm32_data *data = dev->data;
    int res = -1;

    while (1) {
        res = k_sem_take(&data->status_sem, K_MSEC(1000));
        if (res != 0) {
            return res;
        }
...

Just replace K_MSEC(1000) with K_FOREVER.

Now running spitest at 100 KHz yields the following result:

STM32 SPI slave at 100 KHz

The transfer works properly at 100 KHz. At 10 MHz the transfer sometimes works, sometimes does not:

STM32 SPI slave at 10 MHz

At 24 MHz transfer is completely corrupted. We have been looking for a solution in Zephyr Issues and Pull Requests but found nothing useful.

Looking at Zephyr’s SPI driver code, we discovered that every call to spi_write causes many things to happen. Among others, the SPI controller is reconfigured every single time. During this process, the SPI controller is disabled and re-enabled, which is quite suspicious.

Reading STM32 documentation

I’ve been searching through STM32 documentation for information about high-speed SPI. The most helpful were the STM32L4 series programming manual and AN5543. Section 4.2 of AN5543 describes various aspects of handling high-speed communication, and section 4.2.2 describes what happens when SPI is disabled.

The main problem here is that Zephyr (as well as STM32 HAL) re-configures SPI before each transaction, doing configure-enable-transmit-disable cycle on each SPI session. While this is ok for master, slave must respect timings imposed by master, so SPI disabling should be avoided if not needed.

The problem becomes even more evident when we want to implement TPM protocol as we don’t know size (and direction) of data payload. Each TPM frame starts with a 4 byte header which tells us what is the size of transfer and what is the direction (read from or write to a register):

img

After we read the header, we disable SPI, causing a few things:

  • MISO is left floating (we have SPI v1.3 on STM32L4)
  • we introduce additional delay by re-configuring SPI

Fixing SPI

We decided to continue the tests using only HAL and STM32CubeIDE (we plan to port the solution back to Zephyr). From earlier tests, we already know that HAL also does not work correctly, but it is easier to roll out a custom solution.

So, I created a new STM32CubeMX project and set up the SPI2 controller through the graphical configuration manager. Basic settings involve configuring SPI as a Full-Duplex Slave, configuring NSS (Chip Select) pin as input, setting 8-bit frame length (as required by TPM spec), and setting up DMA channels. All other settings are left at their defaults.

img

img

STM32CubeMX generates code that performs hardware initialization, and we are ready to do SPI transactions using the HAL_SPI_TransmitReceive_DMA function. Let’s look at the implementation:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
HAL_StatusTypeDef HAL_SPI_TransmitReceive_DMA(SPI_HandleTypeDef *hspi, uint8_t *pTxData, uint8_t *pRxData, uint16_t Size)
{
  ...

  /* Reset the threshold bit */
  CLEAR_BIT(hspi->Instance->CR2, SPI_CR2_LDMATX | SPI_CR2_LDMARX);

  /* The packing mode management is enabled by the DMA settings according the spi data size */
  if (hspi->Init.DataSize > SPI_DATASIZE_8BIT)
  {
    /* Set fiforxthreshold according the reception data length: 16bit */
    CLEAR_BIT(hspi->Instance->CR2, SPI_RXFIFO_THRESHOLD);
  }
  else
  {
    /* Set RX Fifo threshold according the reception data length: 8bit */
    SET_BIT(hspi->Instance->CR2, SPI_RXFIFO_THRESHOLD);

    if (hspi->hdmatx->Init.MemDataAlignment == DMA_MDATAALIGN_HALFWORD)
    {
      if ((hspi->TxXferSize & 0x1U) == 0x0U)
      {
        CLEAR_BIT(hspi->Instance->CR2, SPI_CR2_LDMATX);
        hspi->TxXferCount = hspi->TxXferCount >> 1U;
      }
      else
      {
        SET_BIT(hspi->Instance->CR2, SPI_CR2_LDMATX);
        hspi->TxXferCount = (hspi->TxXferCount >> 1U) + 1U;
      }
    }

    if (hspi->hdmarx->Init.MemDataAlignment == DMA_MDATAALIGN_HALFWORD)
    {
      /* Set RX Fifo threshold according the reception data length: 16bit */
      CLEAR_BIT(hspi->Instance->CR2, SPI_RXFIFO_THRESHOLD);

      if ((hspi->RxXferCount & 0x1U) == 0x0U)
      {
        CLEAR_BIT(hspi->Instance->CR2, SPI_CR2_LDMARX);
        hspi->RxXferCount = hspi->RxXferCount >> 1U;
      }
      else
      {
        SET_BIT(hspi->Instance->CR2, SPI_CR2_LDMARX);
        hspi->RxXferCount = (hspi->RxXferCount >> 1U) + 1U;
      }
    }
  }

  /* Check if we are in Rx only or in Rx/Tx Mode and configure the DMA transfer complete callback */
  if (hspi->State == HAL_SPI_STATE_BUSY_RX)
  {
    /* Set the SPI Rx DMA Half transfer complete callback */
    hspi->hdmarx->XferHalfCpltCallback = SPI_DMAHalfReceiveCplt;
    hspi->hdmarx->XferCpltCallback     = SPI_DMAReceiveCplt;
  }
  else
  {
    /* Set the SPI Tx/Rx DMA Half transfer complete callback */
    hspi->hdmarx->XferHalfCpltCallback = SPI_DMAHalfTransmitReceiveCplt;
    hspi->hdmarx->XferCpltCallback     = SPI_DMATransmitReceiveCplt;
  }

  /* Set the DMA error callback */
  hspi->hdmarx->XferErrorCallback = SPI_DMAError;

  /* Set the DMA AbortCpltCallback */
  hspi->hdmarx->XferAbortCallback = NULL;

  /* Enable the Rx DMA Stream/Channel  */
  if (HAL_OK != HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)hspi->pRxBuffPtr,
                                 hspi->RxXferCount))
  {
    /* Update SPI error code */
    SET_BIT(hspi->ErrorCode, HAL_SPI_ERROR_DMA);
    errorcode = HAL_ERROR;

    hspi->State = HAL_SPI_STATE_READY;
    goto error;
  }

  /* Enable Rx DMA Request */
  SET_BIT(hspi->Instance->CR2, SPI_CR2_RXDMAEN);

  /* Set the SPI Tx DMA transfer complete callback as NULL because the communication closing
  is performed in DMA reception complete callback  */
  hspi->hdmatx->XferHalfCpltCallback = NULL;
  hspi->hdmatx->XferCpltCallback     = NULL;
  hspi->hdmatx->XferErrorCallback    = NULL;
  hspi->hdmatx->XferAbortCallback    = NULL;

  /* Enable the Tx DMA Stream/Channel  */
  if (HAL_OK != HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)hspi->pTxBuffPtr, (uint32_t)&hspi->Instance->DR,
                                 hspi->TxXferCount))
  {
    /* Update SPI error code */__HAL_SPI_ENABLE
    SET_BIT(hspi->ErrorCode, HAL_SPI_ERROR_DMA);
    errorcode = HAL_ERROR;

    hspi->State = HAL_SPI_STATE_READY;
    goto error;
  }

  /* Check if the SPI is already enabled */
  if ((hspi->Instance->CR1 & SPI_CR1_SPE) != SPI_CR1_SPE)
  {
    /* Enable SPI peripheral */
    __HAL_SPI_ENABLE(hspi);
  }
  /* Enable the SPI Error Interrupt Bit */
  __HAL_SPI_ENABLE_IT(hspi, (SPI_IT_ERR));

  /* Enable Tx DMA Request */
  SET_BIT(hspi->Instance->CR2, SPI_CR2_TXDMAEN);

  ...
}

What this code does:

  • Initialize callbacks (like transfer complete callbacks)
  • Configure SPI registers
  • Initialize DMA channels and enable DMA on SPI controller (RXDMAEN and TXDMAEN bits)
  • Enable SPI interrupts
  • Enable SPI controller

Many of these things could be done only once and never changed. Doing this every time introduces additional overhead. Moreover, SPI is re-enabled before each transaction and disabled after the transaction. This worsens the overhead and causes other problems described in AN5543:

SPI versions 1.x.x: the peripheral takes no control of the associated GPIOs when it is disabled. The SPI signals float if they are not supported by external resistor and if they are not reconfigured and they are kept at alternate function configuration.

At principle, the SPI must not be disabled before the communication is fully completed and it should be as short as possible at slave, especially between sessions, to avoid missing any communication.

On Nucleo L476RG we use, we have SPI v1.3, which does not drive MISO when disabled. We have observed MISO line changing unexpectedly during SPI idle periods, presumably caused by this.

HAL_SPI_TransmitReceive_DMA setups interrupt callbacks which handle error detection and the end-of-transaction condition (SPI_EndRxTxTransaction), which involves waiting for the master to stop sending data and the SPI bus to become idle. This causes more unnecessary overhead, as we don’t have to wait for SPI idle. We can process data as soon as RX DMA completes and queue more data as soon as TX DMA completes.

A transaction in the TPM protocol consists of three steps: TPM header transmission, flow control, and data payload transmission. After receiving the header, we know the size of the entire transaction, removing the need for end-of-transaction checking.

I created a stripped-down version of HAL_SPI_TransmitReceive_DMA:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
static uint8_t tx_buf[4] = {0xaa, 0xe0, 0xbb, 0xa5};
static uint8_t rx_buf[4] = {0};

static void rxdma_complete(DMA_HandleTypeDef *hdma)
{
    SPI_HandleTypeDef *hspi = (SPI_HandleTypeDef *)(((DMA_HandleTypeDef *)hdma)->Parent);
    HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)rx_buf, 4);
}

static void txdma_complete(DMA_HandleTypeDef *hdma)
{
    SPI_HandleTypeDef *hspi = (SPI_HandleTypeDef *)(((DMA_HandleTypeDef *)hdma)->Parent);
    HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)tx_buf, (uint32_t)&hspi->Instance->DR, 4);
}

void app_main() {
    SPI_HandleTypeDef *hspi = &hspi2;

    // Initialize callbacks
    hspi->hdmatx->XferCpltCallback = txdma_complete;
    hspi->hdmarx->XferCpltCallback = rxdma_complete;

    // One-time SPI configuration
    // Clear SPI_RXFIFO_THRESHOLD to trigger DMA on each byte available.
    CLEAR_BIT(hspi->Instance->CR2, SPI_RXFIFO_THRESHOLD);

    // Start the transfer
    SET_BIT(hspi->Instance->CR2, SPI_CR2_RXDMAEN);
    HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)rx_buf, 4);

    HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)tx_buf, (uint32_t)&hspi->Instance->DR, 4);
    SET_BIT(hspi->Instance->CR2, SPI_CR2_TXDMAEN);

    __HAL_SPI_ENABLE(hspi);
}

The code size is reduced to almost a minimum - still, some optimizations could be done in HAL_DMA_Start_IT. Currently, we transmit only 4 bytes of static data to test whether MCU can handle this before going further.

I’m using a bit different initialization sequence than HAL: HAL enables RXDMAEN after programming the channel and TXDMAEN after enabling SPI. Our code follows the sequence described in the STM32 Programming Manual (rm0351).

img

For testing purposes, I’m using Raspberry PI 3B as SPI host. Configuration is pretty straightforward, you can enable spidev by uncommenting dtoverlay=spi0-1cs in /boot/config.txt and rebooting. For communicating with spidev I’m using a custom Python script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
from spidev import SpiDev

class Spi:
    def __init__(self):
        self.device = SpiDev()
        self.device.open(0, 0)
        self.device.bits_per_word = 8
        self.device.mode = 0b00
        self.device.max_speed_hz = 24000000
        self.freq = 24000000

    def get_frequency(self):
        return self.freq

    def set_frequency(self, freq):
        self.freq = freq

    def xfer(self, data: bytes) -> bytes:
        return self.device.xfer(data, self.freq)


def test(func):
    def wrapper():
        freq = spi.freq
        iteration = 0

        try:
            for i in range(10):
                iteration = i
                func()
            print(f'OK: {func.__name__} @ {freq} Hz')
        except AssertionError:
            print(f'FAIL: {func.__name__} @ {freq} Hz (iteration {iteration})')
    return wrapper


@test
def test_read_constant_4byte():
    expected_data = [0xaa, 0xe0, 0xbb, 0xa5]
    result = spi.xfer([0xff] * len(expected_data))
    print('result = {:x} {:x} {:x} {:x}'.format(result[0], result[1], result[2], result[3]))
    assert result == expected_data


def main():
    global spi

    try:
        spi = Spi()
        spi.set_frequency(24000000)
        test_read_constant_4byte()
    finally:
        spi.device.close()

if __name__ == '__main__':
    main()

After running the test code, I saw the transmitted data was correct through the logic analyzer, but Raspberry PI didn’t receive the right data. This was a problem with the connection between Raspberry PI and Nucleo. I could achieve stable transmission at frequencies up to 18 MHz. After changing cable connections, I got stable transmission at 22 MHz. Before, I was using two 20 cm male-to-female jumper wires for each SPI line. The cables and the logic analyzer probes were connected to a breadboard. Now, I have a direct connection between Nucleo and Raspberry using a single 20 cm female jumper wire for each line. Nucleo pins stretch into two sides of the board, so I can attach the probes directly on the backside of Nucleo.

The work continues - implementing TPM protocol

While 22 MHz is not the frequency we aim for, I continued tests on the highest frequency I could afford for now (in the meantime planning to replace the cables with better ones). I extended the code to speak over the TPM protocol

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
#define STATE_WAIT_HEADER 0
#define STATE_WAIT_STATE 1
#define STATE_WAIT_STATE_LAST 2
#define STATE_PAYLOAD_TRANSFER 3

static uint8_t ff_buffer[64];
static uint8_t waitstate_insert[4] = {0xff, 0xff, 0xff, 0xfe};
static uint8_t waitstate_hold[1] = {0x00};

static uint8_t waitstate_cancel[1] = {0x01};
static uint8_t trash[1] = {0};
static uint8_t header[4] = {0};
static bool b_txdma_complete = false;
static bool b_rxdma_complete = false;
static uint8_t state = STATE_WAIT_HEADER;

static uint8_t scratch_buffer[64] = {0};

static uint8_t transfer_is_read = 0;
static uint32_t transfer_length = 0;

static void update_state(SPI_HandleTypeDef *hspi)
{
    b_rxdma_complete = false;
    b_txdma_complete = false;

    switch (state) {
    case STATE_WAIT_HEADER:
        // We don't care what host sends during wait state, but we start DMA anyway to avoid overrun errors.
        HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)trash, sizeof trash);
        // Wait state got inserted while reading header.
        HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)waitstate_cancel, (uint32_t)&hspi->Instance->DR, sizeof waitstate_cancel);

    // This follows a real TPM protocol, except we ignore addr currently.
        transfer_is_read = !!(header[0] & (1 << 7));
        transfer_length = (header[0] & 0x3f) + 1;
    // Remaining bytes contain TPM register offset. Currently we have only one
    // "register" so we just ignore that.

        state = STATE_WAIT_STATE_LAST;
        break;

    case STATE_WAIT_STATE_LAST:
        if (transfer_is_read) {
            HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)scratch_buffer, (uint32_t)&hspi->Instance->DR, transfer_length);
            HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)ff_buffer, transfer_length);
        } else {
            HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)ff_buffer, (uint32_t)&hspi->Instance->DR, transfer_length);
            HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)scratch_buffer, transfer_length);
        }

        state = STATE_PAYLOAD_TRANSFER;
        break;

    case STATE_PAYLOAD_TRANSFER:
        HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)header, sizeof header);
        HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)waitstate_insert, (uint32_t)&hspi->Instance->DR, sizeof waitstate_insert);

        state = STATE_WAIT_HEADER;
        break;
    }
}

static void rxdma_complete(DMA_HandleTypeDef *hdma)
{
    SPI_HandleTypeDef *hspi = (SPI_HandleTypeDef *)(((DMA_HandleTypeDef *)hdma)->Parent);
    if (b_txdma_complete)
        update_state(hspi);
    else
        b_rxdma_complete = true;
}

static void txdma_complete(DMA_HandleTypeDef *hdma)
{
    SPI_HandleTypeDef *hspi = (SPI_HandleTypeDef *)(((DMA_HandleTypeDef *)hdma)->Parent);
    if (b_rxdma_complete)
        update_state(hspi);
    else
        b_txdma_complete = true;
}

On Python side I introduce a new function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
def tpm_read(size: int) -> bytes:
    assert size > 0 and size <= 64

    header = bytes([
        (1 << 7) | (size - 1),
        0, 0, 0
    ])

    waitstate = spi.xfer(header)
    assert waitstate == [0xff, 0xff, 0xff, 0xfe]

    while True:
        status = spi.xfer([0])
        if status[0] == 1:
            break
        assert status[0] == 0

    empty = [0] * size
    return spi.xfer(empty)

And update the test:

1
2
3
4
@test
def test_read():
    x = tpm_read(0, 8)
    assert x == [0] * 8

After running the test code, I immediately got an error, the logic analyzer showing:

img

There are two problems here. The first problem is that the CS pin goes high between the header, wait states, and payload. This was my oversight, but fixing it is not critical as it currently does not affect communication - deasserting the CS pin should abort the transaction, but we don’t handle this yet. Linux’s spidev drivers can be instructed not to deassert CS, but this is not supported by the bindings I’m using, so let’s just postpone the fix.

The other problem is with the transmission itself - Nucleo transmits wrong data (0xff) instead of 0x01 during the wait state.

To solve the problem, I went a step back. I hardcoded a few data patterns to replicate the transfer sequence:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
struct pattern {
    uint8_t *data;
    uint8_t len;
};

#define ARRAY_SIZE(x) (sizeof((x)) / sizeof((x)[0]))

static uint8_t pattern_0[] = {0xff, 0xff, 0xff, 0xfe};
static uint8_t pattern_1[] = {1};
static uint8_t pattern_2[] = {
    0x3e, 0x60, 0xc3, 0x4f, 0x35, 0x2e, 0xa6, 0xaa, 0xa6, 0x61, 0x64, 0xcb,
    0x10, 0xd7, 0x45, 0x35, 0x82, 0xc9, 0x91, 0xbc, 0x35, 0x43, 0xbb, 0xe1,
    0xea, 0x08, 0xdf, 0xdd, 0x4d, 0xd8, 0xd5, 0x94, 0x71, 0x75, 0xfd, 0x23,
    0x24, 0xf8, 0x95, 0x85, 0x7b, 0x11, 0xf9, 0xdd, 0xa0, 0xaa, 0x60, 0xc5,
    0xd2, 0x07, 0x6b, 0x3a, 0xd4, 0xd2, 0xac, 0xac, 0x1b, 0x54, 0xfe, 0x2f,
    0xa2
};

static struct pattern patterns[] = {
    { .data = pattern_0, .len = sizeof pattern_0 },
    { .data = pattern_1, .len = sizeof pattern_1 },
    { .data = pattern_2, .len = sizeof pattern_2 },
};
int current_pattern = 1;

static void txdma_complete(DMA_HandleTypeDef *hdma)
{
    SPI_HandleTypeDef *hspi = (SPI_HandleTypeDef *)(((DMA_HandleTypeDef *)hdma)->Parent);
    HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)patterns[current_pattern].data, (uint32_t)&hspi->Instance->DR, patterns[current_pattern].len);

    if (++current_pattern == ARRAY_SIZE(patterns))
        current_pattern = 0;
}

void app_main() {
    hspi->hdmatx->XferCpltCallback = txdma_complete;

    HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)pattern_0, (uint32_t)&hspi->Instance->DR, sizeof pattern_0);
    SET_BIT(hspi->Instance->CR2, SPI_CR2_TXDMAEN);

    __HAL_SPI_ENABLE(hspi);
}

And on Python side:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
patterns = [
    [0xff, 0xff, 0xff, 0xfe],
    [0x01],
    [
        0x3e, 0x60, 0xc3, 0x4f, 0x35, 0x2e, 0xa6, 0xaa, 0xa6, 0x61, 0x64, 0xcb,
        0x10, 0xd7, 0x45, 0x35, 0x82, 0xc9, 0x91, 0xbc, 0x35, 0x43, 0xbb, 0xe1,
        0xea, 0x08, 0xdf, 0xdd, 0x4d, 0xd8, 0xd5, 0x94, 0x71, 0x75, 0xfd, 0x23,
        0x24, 0xf8, 0x95, 0x85, 0x7b, 0x11, 0xf9, 0xdd, 0xa0, 0xaa, 0x60, 0xc5,
        0xd2, 0x07, 0x6b, 0x3a, 0xd4, 0xd2, 0xac, 0xac, 0x1b, 0x54, 0xfe, 0x2f,
        0xa2
    ]
]

spi = Spi()
spi.set_frequency(22000000)

for i in range(100000):
    for pattern in patterns:
        data = spi.xfer([0] * len(pattern))
        print(f'iter {i} test {data} == {pattern}')
        assert data == pattern

The code works just fine (100k iterations)

1
2
3
4
5
...
iter 99999 test [255, 255, 255, 254] == [255, 255, 255, 254]
iter 99999 test [1] == [1]
iter 99999 test [62, 96, 195, 79, 53, 46, 166, 170, 166, 97, 100, 203, 16, 215, 69, 53, 130, 201, 145, 188, 53, 67, 187, 225, 234, 8, 223, 221, 77, 216, 213, 148, 113, 117, 253, 35, 36, 248, 149, 133, 123, 17, 249, 221,
160, 170, 96, 197, 210, 7, 107, 58, 212, 210, 172, 172, 27, 84, 254, 47, 162] == [62, 96, 195, 79, 53, 46, 166, 170, 166, 97, 100, 203, 16, 215, 69, 53, 130, 201, 145, 188, 53, 67, 187, 225, 234, 8, 223, 221, 77, 216, 213, 148, 113, 117, 253, 35, 36, 248, 149, 133, 123, 17, 249, 221, 160, 170, 96, 197, 210, 7, 107, 58, 212, 210, 172, 172, 27, 84, 254, 47, 162]

The main difference is that the full code performs reading and writing, contrary to only writing. Currently, we wait for both TX and RX DMA to complete before re-programming DMA channels and updating the state machine. TX and RX are always the same size, so they should complete in a similar time. So, instead of using interrupts for both channels, I changed the code so that interrupts are used for TX and polling for RX (tests showed that TX DMA usually completes first).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
static void txdma_complete(DMA_HandleTypeDef *hdma)
{
    SPI_HandleTypeDef *hspi = (SPI_HandleTypeDef *)(((DMA_HandleTypeDef *)hdma)->Parent);

    switch (state) {
    case STATE_WAIT_HEADER:
        // Wait state got inserted while reading header.
        HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)waitstate_cancel, (uint32_t)&hspi->Instance->DR, sizeof waitstate_cancel);

        // We don't care what host sends during wait state, but we start DMA anyway to avoid overrun errors.
        HAL_DMA_PollForTransfer(hspi->hdmarx, HAL_DMA_FULL_TRANSFER, HAL_MAX_DELAY);
        HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)trash, sizeof trash);

        transfer_is_read = !!(header[0] & (1 << 7));
        transfer_length = (header[0] & 0x3f) + 1;

        state = STATE_WAIT_STATE_LAST;
        break;

    case STATE_WAIT_STATE_LAST:
        if (transfer_is_read) {
            HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)scratch_buffer, (uint32_t)&hspi->Instance->DR, transfer_length);
            HAL_DMA_PollForTransfer(hspi->hdmarx, HAL_DMA_FULL_TRANSFER, HAL_MAX_DELAY);
            HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)ff_buffer, transfer_length);
        } else {
            HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)ff_buffer, (uint32_t)&hspi->Instance->DR, transfer_length);
            HAL_DMA_PollForTransfer(hspi->hdmarx, HAL_DMA_FULL_TRANSFER, HAL_MAX_DELAY);
            HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)scratch_buffer, transfer_length);
        }

        state = STATE_PAYLOAD_TRANSFER;
        break;

    case STATE_PAYLOAD_TRANSFER:
        HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)waitstate_insert, (uint32_t)&hspi->Instance->DR, sizeof waitstate_insert);
        HAL_DMA_PollForTransfer(hspi->hdmarx, HAL_DMA_FULL_TRANSFER, HAL_MAX_DELAY);
        HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)header, sizeof header);

        state = STATE_WAIT_HEADER;
        break;
    }
}

I start the TX transfer first, then poll for RX DMA completion before re-programming the DMA channel. Now, the test succeeds.

Extending tests

I have basic code that can read and write data over SPI, but I have tested only read of a zeroed register. Now, it is time to extend the tests so that we write random data of random lengths, then read the data back and check whether it is as expected. I already got shorter cables - 10 cm instead of 20 cm, and I have stable communication at 24 MHz:

img

I started with something simple

1
2
3
tpm_write(0, bytes([1,2,3,4,5,6,7,8]))
x = tpm_read(0, 8)
assert x == [1,2,3,4,5,6,7,8]

and failed. The first transfer succeeded, but the second did not:

img

I hooked the debugger and saw that app was still polling for RX DMA completion. Looking again at the original code, I found that I incorrectly cleared SPI_RXFIFO_THRESHOLD bit - it should be clear for 16-bit frame length and set for 8-bit frame length.

Changing

1
CLEAR_BIT(hspi->Instance->CR2, SPI_RXFIFO_THRESHOLD);

to

1
SET_BIT(hspi->Instance->CR2, SPI_RXFIFO_THRESHOLD);

solved the problem, however I got another one.

img

The wait state is properly inserted and terminated, but the payload is invalid. I split the test into two to pause the app between write and read from the register. Peeking at the scratch_buffer reveals that DMA went wrong, as the first three bytes were lost entirely.

img

Moreover, we are again stuck polling for DMA completion (DMA is still waiting for the remaining three bytes). The issue could be caused by too high delays between restarting of DMA transfers, so I lowered the SPI frequency to 100 KHz, but to my surprise, the result was exactly the same. I tested different data sizes, and the result was always the same (3 bytes lost). So, the SPI_RXFIFO_THRESHOLD fix only moved the problem a bit further. The outcome is still the same.

Summary

That’s all for this blog post. I got SPI working at 24 MHz when writing, but reading is broken. This is a significant improvement. I can’t tell whether SPI could work at 24 MHz on that platform - even though it works for write, it doesn’t mean it would work for reads and writes simultaneously. Further work could include fixing RX on the current platform, but we could also try using different platforms. We could try newer STM32 CPUs with a more recent SPI version (and possibly a higher clock frequency), such as the STM32L5 series or the latest STM32U5 series.

Further work will surely include implementing missing features, such as SPI bus aborts, SPI synchronization (using CS pin), and error recovery. Also, some patches may be needed for Zephyr due to suboptimal handling of SPI transactions. Possibly, on a faster CPU, we could achieve 24 MHz without any problems, but we could run into similar issues trying to work at the optional 66 MHz frequency. Also, Zephyr’s SPI API currently doesn’t support transmission of variable-length frames. There is an open RFC issue that covers API changes and optimized SPI handling to enable the usage of protocols requiring them, so our future work could also include working on API improvements. Lastly, we plan to upstream all Zephyr patches (if any).

The code for STM32 application is available here.


Artur Kowalski
Junior embedded developer at 3mdeb. Interested in low-level development ranging from microcontroller programming to hypervisor and kernel development. In free time working on various personal open-source projects.