Optimizing SPI communication on STM32 MCUs: a comprehensive guide to high-frequency communication
STM32 MCUs come with various peripherals, one of them is SPI (Serial Peripheral
Interface) which is a simple serial bus interface commonly used for
short-distance communication between various devices. SPI is one of the
interfaces used by TPM chips for communication with PC motherboard. SPI uses 4
lines for communication: MOSI, MISO, SCK, SS, which are described down below.
The device must implement the TPM protocol to work as a TPM module. TPM protocol
works by transmitting variable-length frames over SPI. The 4-byte TPM header
contains fields describing the length of the data payload, the address of the
target TPM register, and the transfer direction (read or write). TPM protocol
has its own means of handling flow control (as there isn’t a standard flow
control mechanism on SPI) and for doing bus aborts.
TPM must be able to operate at frequencies from 10 MHz to 24 MHz to comply with
the TCG PTP specification
(see section 7.4.1 and TwPM Documentation).
Getting SPI right on such high frequencies is a significant challenge,
especially when operating as a slave.TPM-specific features complicate things
further. Some platforms require TPMs to support higher frequencies. PTP
specification encourages support for the 33-66 MHz range in addition to the
required range of 10-24 MHz, and future versions of the specification may
mandate higher frequencies, so the platform should be capable of handling them.
Limitations of STM32L476
STM32L476 has SPI capable of frequencies up to a (theoretical) limit of 40 MHz,
which is half of the maximum clock that can be provided to Cortex-M and AHB/APB
buses. SPI capabilities are described in the STM32 Programming Manual
(RM0351)
section 42.2.
Another limiting factor is maximum GPIO speed, which depends on operating
conditions such as the voltage provided to the MCU, ambient temperature, and
parameters of cables used to connect the master and slave. GPIO limitations are
described in the STM32L476RG
datasheet in section 6.4. Table 72 describes the maximum frequency of GPIO
outputs.
More problematic may be DMA limitations. The hard limit of DMA transfer speed
would be 80 Mbits/s as 80 MHz is the maximum frequency that can be provided to
the AHB bus and the MCU. The actual transfer speed may be lower due to AHB and
APB protocol overhead, bus contention, etc. Unfortunately, the datasheet does
not provide information about DMA transfer limitations.
Last but not least, the performance of the firmware itself is significant.
Achieving target frequency may require extensive optimizations.
Creating SPI slave on Zephyr
Zephyr is our platform of choice primarily due to its portability (we will
target non-STM32 platforms too). I will briefly describe the outcome of my early
tests done on Zephyr and why it was terrible.
The application just transmits a static sequence of bytes:
This is the first problem with Zephyr’s SPI driver - each transfer has a
one-second timeout. While this may be desirable behavior for SPI master (it
could be used for error recovery, for example, to power cycle the slave if it
doesn’t respond), it breaks SPI slave. The slave must be ready to give a
response when the transfer commences - appropriate data must already be loaded
in FIFO. Here we get stuck in an endless loop, queuing the transfer, aborting
it, and queuing it again.
The problem can be worked around by patching the
wait_dma_rx_tx_done
function in spi_ll_stm32.c. The original function looks like this:
1
2
3
4
5
6
7
8
9
10
11
staticintwait_dma_rx_tx_done(conststruct device *dev)
{
struct spi_stm32_data *data = dev->data;
int res =-1;
while (1) {
res = k_sem_take(&data->status_sem, K_MSEC(1000));
if (res !=0) {
return res;
}
...
Just replace K_MSEC(1000) with K_FOREVER.
Now running spitest at 100 KHz yields the following result:
The transfer works properly at 100 KHz. At 10 MHz the transfer sometimes works,
sometimes does not:
At 24 MHz transfer is completely corrupted. We have been looking for a solution
in Zephyr Issues and
Pull Requests but found
nothing useful.
Looking at Zephyr’s
SPI driver code, we discovered that every call to spi_write causes many things
to happen. Among others, the SPI controller is reconfigured
every single time.
During this process, the SPI controller is disabled and re-enabled, which is
quite suspicious.
Reading STM32 documentation
I’ve been searching through STM32 documentation for information about high-speed
SPI. The most helpful were the STM32L4 series programming manual and
AN5543.
Section 4.2 of AN5543
describes various aspects of handling high-speed communication, and section
4.2.2 describes what happens when SPI is disabled.
The main problem here is that Zephyr (as well as STM32 HAL) re-configures SPI
before each transaction, doing configure-enable-transmit-disable cycle on each
SPI session. While this is ok for master, slave must respect timings imposed
by master, so SPI disabling should be avoided if not needed.
The problem becomes even more evident when we want to implement TPM protocol
as we don’t know size (and direction) of data payload. Each TPM frame starts
with a 4 byte header which tells us what is the size of transfer and what is the
direction (read from or write to a register):
After we read the header, we disable SPI, causing a few things:
MISO is left floating (we have SPI v1.3 on STM32L4)
we introduce additional delay by re-configuring SPI
Fixing SPI
We decided to continue the tests using only HAL and STM32CubeIDE (we plan to
port the solution back to Zephyr). From earlier tests, we already know that HAL
also does not work correctly, but it is easier to roll out a custom solution.
So, I created a new STM32CubeMX project and set up the SPI2 controller through
the graphical configuration manager. Basic settings involve configuring SPI as a
Full-Duplex Slave, configuring NSS (Chip Select) pin as input, setting 8-bit
frame length (as required by TPM spec), and setting up DMA channels. All other
settings are left at their defaults.
STM32CubeMX generates code that performs hardware initialization, and we are
ready to do SPI transactions using the HAL_SPI_TransmitReceive_DMA function.
Let’s look at the implementation:
HAL_StatusTypeDef HAL_SPI_TransmitReceive_DMA(SPI_HandleTypeDef *hspi, uint8_t*pTxData, uint8_t*pRxData, uint16_t Size)
{
...
/* Reset the threshold bit */
CLEAR_BIT(hspi->Instance->CR2, SPI_CR2_LDMATX | SPI_CR2_LDMARX);
/* The packing mode management is enabled by the DMA settings according the spi data size */if (hspi->Init.DataSize > SPI_DATASIZE_8BIT)
{
/* Set fiforxthreshold according the reception data length: 16bit */
CLEAR_BIT(hspi->Instance->CR2, SPI_RXFIFO_THRESHOLD);
}
else
{
/* Set RX Fifo threshold according the reception data length: 8bit */
SET_BIT(hspi->Instance->CR2, SPI_RXFIFO_THRESHOLD);
if (hspi->hdmatx->Init.MemDataAlignment == DMA_MDATAALIGN_HALFWORD)
{
if ((hspi->TxXferSize &0x1U) ==0x0U)
{
CLEAR_BIT(hspi->Instance->CR2, SPI_CR2_LDMATX);
hspi->TxXferCount = hspi->TxXferCount >>1U;
}
else
{
SET_BIT(hspi->Instance->CR2, SPI_CR2_LDMATX);
hspi->TxXferCount = (hspi->TxXferCount >>1U) +1U;
}
}
if (hspi->hdmarx->Init.MemDataAlignment == DMA_MDATAALIGN_HALFWORD)
{
/* Set RX Fifo threshold according the reception data length: 16bit */
CLEAR_BIT(hspi->Instance->CR2, SPI_RXFIFO_THRESHOLD);
if ((hspi->RxXferCount &0x1U) ==0x0U)
{
CLEAR_BIT(hspi->Instance->CR2, SPI_CR2_LDMARX);
hspi->RxXferCount = hspi->RxXferCount >>1U;
}
else
{
SET_BIT(hspi->Instance->CR2, SPI_CR2_LDMARX);
hspi->RxXferCount = (hspi->RxXferCount >>1U) +1U;
}
}
}
/* Check if we are in Rx only or in Rx/Tx Mode and configure the DMA transfer complete callback */if (hspi->State == HAL_SPI_STATE_BUSY_RX)
{
/* Set the SPI Rx DMA Half transfer complete callback */
hspi->hdmarx->XferHalfCpltCallback = SPI_DMAHalfReceiveCplt;
hspi->hdmarx->XferCpltCallback = SPI_DMAReceiveCplt;
}
else
{
/* Set the SPI Tx/Rx DMA Half transfer complete callback */
hspi->hdmarx->XferHalfCpltCallback = SPI_DMAHalfTransmitReceiveCplt;
hspi->hdmarx->XferCpltCallback = SPI_DMATransmitReceiveCplt;
}
/* Set the DMA error callback */
hspi->hdmarx->XferErrorCallback = SPI_DMAError;
/* Set the DMA AbortCpltCallback */
hspi->hdmarx->XferAbortCallback =NULL;
/* Enable the Rx DMA Stream/Channel */if (HAL_OK != HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)hspi->pRxBuffPtr,
hspi->RxXferCount))
{
/* Update SPI error code */
SET_BIT(hspi->ErrorCode, HAL_SPI_ERROR_DMA);
errorcode = HAL_ERROR;
hspi->State = HAL_SPI_STATE_READY;
goto error;
}
/* Enable Rx DMA Request */
SET_BIT(hspi->Instance->CR2, SPI_CR2_RXDMAEN);
/* Set the SPI Tx DMA transfer complete callback as NULL because the communication closing
is performed in DMA reception complete callback */
hspi->hdmatx->XferHalfCpltCallback =NULL;
hspi->hdmatx->XferCpltCallback =NULL;
hspi->hdmatx->XferErrorCallback =NULL;
hspi->hdmatx->XferAbortCallback =NULL;
/* Enable the Tx DMA Stream/Channel */if (HAL_OK != HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)hspi->pTxBuffPtr, (uint32_t)&hspi->Instance->DR,
hspi->TxXferCount))
{
/* Update SPI error code */__HAL_SPI_ENABLE
SET_BIT(hspi->ErrorCode, HAL_SPI_ERROR_DMA);
errorcode = HAL_ERROR;
hspi->State = HAL_SPI_STATE_READY;
goto error;
}
/* Check if the SPI is already enabled */if ((hspi->Instance->CR1 & SPI_CR1_SPE) != SPI_CR1_SPE)
{
/* Enable SPI peripheral */
__HAL_SPI_ENABLE(hspi);
}
/* Enable the SPI Error Interrupt Bit */
__HAL_SPI_ENABLE_IT(hspi, (SPI_IT_ERR));
/* Enable Tx DMA Request */
SET_BIT(hspi->Instance->CR2, SPI_CR2_TXDMAEN);
...
}
What this code does:
Initialize callbacks (like transfer complete callbacks)
Configure SPI registers
Initialize DMA channels and enable DMA on SPI controller (RXDMAEN and TXDMAEN
bits)
Enable SPI interrupts
Enable SPI controller
Many of these things could be done only once and never changed. Doing this every
time introduces additional overhead. Moreover, SPI is re-enabled before each
transaction and disabled after the transaction. This worsens the overhead and
causes other problems described in
AN5543:
SPI versions 1.x.x: the peripheral takes no control of the associated GPIOs
when it is disabled. The SPI signals float if they are not supported by
external resistor and if they are not reconfigured and they are kept at
alternate function configuration.
At principle, the SPI must not be disabled before the communication is fully
completed and it should be as short as possible at slave, especially between
sessions, to avoid missing any communication.
On Nucleo L476RG we use, we have SPI v1.3, which does not drive MISO when
disabled. We have observed MISO line changing unexpectedly during SPI idle
periods, presumably caused by this.
HAL_SPI_TransmitReceive_DMA setups interrupt callbacks which handle error
detection and the end-of-transaction condition (SPI_EndRxTxTransaction), which
involves waiting for the master to stop sending data and the SPI bus to become
idle. This causes more unnecessary overhead, as we don’t have to wait for SPI
idle. We can process data as soon as RX DMA completes and queue more data as
soon as TX DMA completes.
A transaction in the TPM protocol consists of three steps: TPM header
transmission, flow control, and data payload transmission. After receiving the
header, we know the size of the entire transaction, removing the need for
end-of-transaction checking.
I created a stripped-down version of HAL_SPI_TransmitReceive_DMA:
The code size is reduced to almost a minimum - still, some optimizations could
be done in HAL_DMA_Start_IT. Currently, we transmit only 4 bytes of static
data to test whether MCU can handle this before going further.
I’m using a bit different initialization sequence than HAL: HAL enables
RXDMAEN after programming the channel and TXDMAEN after enabling SPI. Our
code follows the sequence described in the STM32 Programming Manual (rm0351).
For testing purposes, I’m using Raspberry PI 3B as SPI host. Configuration is
pretty straightforward, you can enable spidev by uncommenting
dtoverlay=spi0-1cs in /boot/config.txt and rebooting. For communicating with
spidev I’m using a custom Python script:
After running the test code, I saw the transmitted data was correct through the
logic analyzer, but Raspberry PI didn’t receive the right data. This was a
problem with the connection between Raspberry PI and Nucleo. I could achieve
stable transmission at frequencies up to 18 MHz. After changing cable
connections, I got stable transmission at 22 MHz. Before, I was using two 20 cm
male-to-female jumper wires for each SPI line. The cables and the logic analyzer
probes were connected to a breadboard. Now, I have a direct connection between
Nucleo and Raspberry using a single 20 cm female jumper wire for each line.
Nucleo pins stretch into two sides of the board, so I can attach the probes
directly on the backside of Nucleo.
The work continues - implementing TPM protocol
While 22 MHz is not the frequency we aim for, I continued tests on the highest
frequency I could afford for now (in the meantime planning to replace the cables
with better ones). I extended the code to speak over the TPM protocol
@testdeftest_read():
x = tpm_read(0, 8)
assert x == [0] *8
After running the test code, I immediately got an error, the logic analyzer
showing:
There are two problems here. The first problem is that the CS pin goes high
between the header, wait states, and payload. This was my oversight, but fixing
it is not critical as it currently does not affect communication - deasserting
the CS pin should abort the transaction, but we don’t handle this yet. Linux’s
spidev drivers can be instructed not to deassert CS, but this is not supported
by the bindings I’m using, so let’s just postpone the fix.
The other problem is with the transmission itself - Nucleo transmits wrong data
(0xff) instead of 0x01 during the wait state.
To solve the problem, I went a step back. I hardcoded a few data patterns to
replicate the transfer sequence:
The main difference is that the full code performs reading and writing, contrary
to only writing. Currently, we wait for both TX and RX DMA to complete before
re-programming DMA channels and updating the state machine. TX and RX are always
the same size, so they should complete in a similar time. So, instead of using
interrupts for both channels, I changed the code so that interrupts are used for
TX and polling for RX (tests showed that TX DMA usually completes first).
staticvoidtxdma_complete(DMA_HandleTypeDef *hdma)
{
SPI_HandleTypeDef *hspi = (SPI_HandleTypeDef *)(((DMA_HandleTypeDef *)hdma)->Parent);
switch (state) {
case STATE_WAIT_HEADER:
// Wait state got inserted while reading header.
HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)waitstate_cancel, (uint32_t)&hspi->Instance->DR, sizeof waitstate_cancel);
// We don't care what host sends during wait state, but we start DMA anyway to avoid overrun errors.
HAL_DMA_PollForTransfer(hspi->hdmarx, HAL_DMA_FULL_TRANSFER, HAL_MAX_DELAY);
HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)trash, sizeof trash);
transfer_is_read =!!(header[0] & (1<<7));
transfer_length = (header[0] &0x3f) +1;
state = STATE_WAIT_STATE_LAST;
break;
case STATE_WAIT_STATE_LAST:
if (transfer_is_read) {
HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)scratch_buffer, (uint32_t)&hspi->Instance->DR, transfer_length);
HAL_DMA_PollForTransfer(hspi->hdmarx, HAL_DMA_FULL_TRANSFER, HAL_MAX_DELAY);
HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)ff_buffer, transfer_length);
} else {
HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)ff_buffer, (uint32_t)&hspi->Instance->DR, transfer_length);
HAL_DMA_PollForTransfer(hspi->hdmarx, HAL_DMA_FULL_TRANSFER, HAL_MAX_DELAY);
HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)scratch_buffer, transfer_length);
}
state = STATE_PAYLOAD_TRANSFER;
break;
case STATE_PAYLOAD_TRANSFER:
HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)waitstate_insert, (uint32_t)&hspi->Instance->DR, sizeof waitstate_insert);
HAL_DMA_PollForTransfer(hspi->hdmarx, HAL_DMA_FULL_TRANSFER, HAL_MAX_DELAY);
HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)header, sizeof header);
state = STATE_WAIT_HEADER;
break;
}
}
I start the TX transfer first, then poll for RX DMA completion before
re-programming the DMA channel. Now, the test succeeds.
Extending tests
I have basic code that can read and write data over SPI, but I have tested only
read of a zeroed register. Now, it is time to extend the tests so that we write
random data of random lengths, then read the data back and check whether it is
as expected. I already got shorter cables - 10 cm instead of 20 cm,
and I have stable communication at 24 MHz:
I started with something simple
1
2
3
tpm_write(0, bytes([1,2,3,4,5,6,7,8]))
x = tpm_read(0, 8)
assert x == [1,2,3,4,5,6,7,8]
and failed. The first transfer succeeded, but the second did not:
I hooked the debugger and saw that app was still polling for RX DMA completion.
Looking again at the original code, I found that I incorrectly cleared
SPI_RXFIFO_THRESHOLD bit - it should be clear for 16-bit frame length and set
for 8-bit frame length.
The wait state is properly inserted and terminated, but the payload is invalid.
I split the test into two to pause the app between write and read from the
register. Peeking at the scratch_buffer reveals that DMA went wrong, as the
first three bytes were lost entirely.
Moreover, we are again stuck polling for DMA completion (DMA is still waiting
for the remaining three bytes). The issue could be caused by too high delays
between restarting of DMA transfers, so I lowered the SPI frequency to 100 KHz,
but to my surprise, the result was exactly the same. I tested different data
sizes, and the result was always the same (3 bytes lost). So, the
SPI_RXFIFO_THRESHOLD fix only moved the problem a bit further. The outcome is
still the same.
Summary
That’s all for this blog post. I got SPI working at 24 MHz when writing, but
reading is broken. This is a significant improvement. I can’t tell whether SPI
could work at 24 MHz on that platform - even though it works for write, it
doesn’t mean it would work for reads and writes simultaneously. Further work
could include fixing RX on the current platform, but we could also try using
different platforms. We could try newer STM32 CPUs with a more recent SPI
version (and possibly a higher clock frequency), such as the STM32L5 series or
the latest STM32U5 series.
Further work will surely include implementing missing features, such as SPI bus
aborts, SPI synchronization (using CS pin), and error recovery. Also, some
patches may be needed for Zephyr due to suboptimal handling of SPI transactions.
Possibly, on a faster CPU, we could achieve 24 MHz without any problems, but we
could run into similar issues trying to work at the optional 66 MHz frequency.
Also, Zephyr’s SPI API currently doesn’t support transmission of variable-length
frames. There is an open RFC
issue that covers
API changes and optimized SPI handling to enable the usage of protocols
requiring them, so our future work could also include working on API
improvements. Lastly, we plan to upstream all Zephyr patches (if any).
Junior embedded developer at 3mdeb. Interested in low-level development ranging from microcontroller programming to hypervisor and kernel development. In free time working on various personal open-source projects.