Pushing hardware to its limits
In the epoch of efficient and fast processors, performance becomes one of the
most crucial aspects when choosing and working with hardware. We want our
computers to execute their tasks with possibly highest speeds. But what really
influences the performance of our platforms? It’s the processor’s manufacturer
design one may say. In this post, I will show You how firmware may boost Your
silicon to higher performance level. On the example of PC Engines apu2c4
platform, I will present Core Performance Boost feature.

Core Performance Boost (CPB) is a feature that allows increasing the frequency
of the processor’s core exceeding its nominal values. Similarly to Intel’s Turbo
Boost Technology, AMD Core Performance Boost temporarily raises the frequency of
a single core when the operating system requests the highest processor
performance.
Enabling the CPB feature is relatively easy since coreboot uses proprietary
initialization code from AMD for the apu2 processor called AGESA, which have
support for CPB initialization.
In order to enable CPB feature one must add following lines to OEM Customize in
src/mainboard/pcengines/apu2/OemCustomize.c
:
1
2
3
4
5
6
7
8
9
|
VOID
OemCustomizeInitEarly (
IN OUT AMD_EARLY_PARAMS *InitEarly
)
{
InitEarly->GnbConfig.PcieComplexList = &PcieComplex;
+ InitEarly->PlatformConfig.CStateMode = CStateModeC6;
+ InitEarly->PlatformConfig.CpbMode = CpbModeAuto;
}
|
These values will be passed to AGESA, which will handle initialization of the
CPB feature.
How to prove the performance gain without tests and benchmarks? First of all, I
have performed a few tests using memtest86+ in BIOS and Linux OS utilities like
stress/stress-ng, dd etc. Furthermore, I have launched one benchmark in order to
show how performance increased by enabling the CPB feature.
All test have been performed on Debian Linux installed on mSATA SSD:
1
|
Linux apu2 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64 GNU/Linux
|
CPB disabled
First, let’s try reference v4.9.0.1 firmware without CPB:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
$ stress -c 1 &
$ watch -n 1 cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq
600000
600000
1000000
600000
$ stress-ng --cpu 1 --cpu-method matrixprod --timeout 30 --metrics
stress-ng: info: [493] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
stress-ng: info: [493] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: info: [493] cpu 580 30.02 29.99 0.00 19.32 19.34
|
One can see that the frequency during the stress test is limited to 1000MHz and
total bogo ops are equal 580 for single core.
Another test may be a raw memory dd:
1
2
|
dd if=/dev/zero of=/dev/null bs=64k count=1M
68719476736 bytes (69 GB, 64 GiB) copied, 30.2523 s, 2.3 GB/s
|
Memtest86+ - CBP disabled
Memtest86+:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
Memtest86+ 5.01 coreboot 002| AMD GX-412TC SOC
CLK: 998.2MHz (X64 Mode) | Pass 1%
L1 Cache: 32K 14058 MB/s | Test 66% #########################
L2 Cache: 2048K 5015 MB/s | Test #3 [Moving inversions, 1s & 0s Parallel]
L3 Cache: None | Testing: 2048M - 3327M 1279M of 4078M
Memory : 4078M 1434 MB/s | Pattern: 00000000 | Time: 0:00:43
----------------------------------------------------------------------
Core#: 0 (SMP: Disabled) | CPU Temp | RAM: 666 MHz (DDR3-1333) - BCLK: 100
State: - Running... | 56 C | Timings: CAS 9-9-10-24 @ 64-bit Mode
Cores: 1 Active / 1 Total (Run: All) | Pass: 0 Errors: 0
------------------------------------------------------------------------------
...
PC Engines apu2
(ESC)exit (c)configuration (SP)scroll_lock (CR)scroll_unlock
|
Notice the cache and memory speeds:
1
2
3
|
L1 Cache: 32K 14058 MB/s
L2 Cache: 2048K 5015 MB/s
Memory : 4078M 1434 MB/s
|
UnixBench benchmark - CBP disabled
I have also selected the UnixBench
to test the processor performance.
How to run:
1
2
3
4
5
|
# it may be necessary to install few packages
apt-get install libx11-dev libgl1-mesa-dev libxext-dev perl perl-modules make git
git clone https://github.com/kdlucas/byte-unixbench.git
cd byte-unixbench/UnixBench/
./Run
|
Running the benchmark takes a while. Be patient.
Results:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
|
========================================================================
BYTE UNIX Benchmarks (Version 5.1.3)
System: apu2: GNU/Linux
OS: GNU/Linux -- 4.9.0-8-amd64 -- #1 SMP Debian 4.9.130-2 (2018-10-27)
Machine: x86_64 (unknown)
Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
CPU 0: AMD GX-412TC SOC (1996.8 bogomips)
Hyper-Threading, x86-64, MMX, AMD MMX, Physical Address Ext, SYSENTER/SYSEXIT, AMD virtualization, SYSCALL/SYSRET
CPU 1: AMD GX-412TC SOC (1996.8 bogomips)
Hyper-Threading, x86-64, MMX, AMD MMX, Physical Address Ext, SYSENTER/SYSEXIT, AMD virtualization, SYSCALL/SYSRET
CPU 2: AMD GX-412TC SOC (1996.8 bogomips)
Hyper-Threading, x86-64, MMX, AMD MMX, Physical Address Ext, SYSENTER/SYSEXIT, AMD virtualization, SYSCALL/SYSRET
CPU 3: AMD GX-412TC SOC (1996.8 bogomips)
Hyper-Threading, x86-64, MMX, AMD MMX, Physical Address Ext, SYSENTER/SYSEXIT, AMD virtualization, SYSCALL/SYSRET
16:11:24 up 2 min, 1 user, load average: 0.05, 0.07, 0.02; runlevel 2019-01-21
------------------------------------------------------------------------
Benchmark Run: Sat Jan 19 2019 16:11:24 - 16:39:27
4 CPUs in system; running 1 parallel copy of tests
Dhrystone 2 using register variables 5792755.2 lps (10.0 s, 7 samples)
Double-Precision Whetstone 1007.6 MWIPS (10.1 s, 7 samples)
Execl Throughput 746.9 lps (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 117729.6 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 33167.2 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 296813.6 KBps (30.0 s, 2 samples)
Pipe Throughput 335334.9 lps (10.0 s, 7 samples)
Pipe-based Context Switching 16882.6 lps (10.0 s, 7 samples)
Process Creation 1652.4 lps (30.0 s, 2 samples)
Shell Scripts (1 concurrent) 1823.6 lpm (60.0 s, 2 samples)
Shell Scripts (8 concurrent) 604.5 lpm (60.0 s, 2 samples)
System Call Overhead 432478.6 lps (10.0 s, 7 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 5792755.2 496.4
Double-Precision Whetstone 55.0 1007.6 183.2
Execl Throughput 43.0 746.9 173.7
File Copy 1024 bufsize 2000 maxblocks 3960.0 117729.6 297.3
File Copy 256 bufsize 500 maxblocks 1655.0 33167.2 200.4
File Copy 4096 bufsize 8000 maxblocks 5800.0 296813.6 511.7
Pipe Throughput 12440.0 335334.9 269.6
Pipe-based Context Switching 4000.0 16882.6 42.2
Process Creation 126.0 1652.4 131.1
Shell Scripts (1 concurrent) 42.4 1823.6 430.1
Shell Scripts (8 concurrent) 6.0 604.5 1007.6
System Call Overhead 15000.0 432478.6 288.3
========
System Benchmarks Index Score 258.7
------------------------------------------------------------------------
Benchmark Run: Sat Jan 19 2019 16:39:27 - 17:07:34
4 CPUs in system; running 4 parallel copies of tests
Dhrystone 2 using register variables 21225450.9 lps (10.0 s, 7 samples)
Double-Precision Whetstone 3641.0 MWIPS (10.0 s, 7 samples)
Execl Throughput 3435.4 lps (29.9 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 148725.9 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 38379.1 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 412590.3 KBps (30.0 s, 2 samples)
Pipe Throughput 1204545.3 lps (10.0 s, 7 samples)
Pipe-based Context Switching 103110.0 lps (10.0 s, 7 samples)
Process Creation 7676.4 lps (30.0 s, 2 samples)
Shell Scripts (1 concurrent) 5091.8 lpm (60.0 s, 2 samples)
Shell Scripts (8 concurrent) 643.2 lpm (60.2 s, 2 samples)
System Call Overhead 1469507.7 lps (10.0 s, 7 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 21225450.9 1818.8
Double-Precision Whetstone 55.0 3641.0 662.0
Execl Throughput 43.0 3435.4 798.9
File Copy 1024 bufsize 2000 maxblocks 3960.0 148725.9 375.6
File Copy 256 bufsize 500 maxblocks 1655.0 38379.1 231.9
File Copy 4096 bufsize 8000 maxblocks 5800.0 412590.3 711.4
Pipe Throughput 12440.0 1204545.3 968.3
Pipe-based Context Switching 4000.0 103110.0 257.8
Process Creation 126.0 7676.4 609.2
Shell Scripts (1 concurrent) 42.4 5091.8 1200.9
Shell Scripts (8 concurrent) 6.0 643.2 1072.0
System Call Overhead 15000.0 1469507.7 979.7
========
System Benchmarks Index Score 688.9
|
Pay attention to System Benchmarks Index Scores
CPB enabled
Let’s now try the firmware with CPB enabled:
1
2
3
4
5
6
7
|
$ stress -c 1 &
$ watch -n 1 cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq
600000
600000
1000000
600000
|
The frequency reported by sysfs, unfortunately, did not change. Let’s try
stress-ng:
1
2
3
4
5
|
$ stress-ng --cpu 1 --cpu-method matrixprod --timeout 30 --metrics
stress-ng: info: [526] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
stress-ng: info: [526] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: info: [526] cpu 591 30.03 30.00 0.00 19.68 19.70
|
Stress-ng launched on 1 core reported 591 bogo ops, which is 2% more than
without CPB (was 580 bogo ops). Not a difference at all.
Raw memory dd:
1
2
|
dd if=/dev/zero of=/dev/null bs=64k count=1M
68719476736 bytes (69 GB, 64 GiB) copied, 23.5088 s, 2.9 GB/s
|
We can see that the speed increased from ~2.5Gb/s to ~3.0Gb/s (~20% increase).
Compared to the results without CPB enabled, these actually prove that the
feature works, because when the boost is on, the core frequency should increase,
along with performance.
Memtest86+ - CBP enabled
Launching memtest86+ in BIOS:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
Memtest86+ 5.01 coreboot 002| AMD GX-412TC SOC
CLK: 998.2MHz (X64 Mode) | Pass 0%
L1 Cache: 32K 21699 MB/s | Test 38% ##############
L2 Cache: 2048K 6980 MB/s | Test #3 [Moving inversions, 1s & 0s Parallel]
L3 Cache: None | Testing: 1024K - 2048M 2047M of 4078M
Memory : 4078M 1992 MB/s | Pattern: ffffffff | Time: 0:00:19
------------------------------------------------------------------------------
Core#: 0 (SMP: Disabled) | CPU Temp | RAM: 666 MHz (DDR3-1333) - BCLK: 100
State: - Running... | 52 C | Timings: CAS 9-9-10-24 @ 64-bit Mode
Cores: 1 Active / 1 Total (Run: All) | Pass: 0 Errors: 0
------------------------------------------------------------------------------
...
PC Engines apu2
(ESC)exit (c)configuration (SP)scroll_lock (CR)scroll_unlock
|
Notice how the memory and cache speeds changed:
1
2
3
|
L1 Cache: 32K 14058 MB/s ---> L1 Cache: 32K 21699 MB/s (~54% change)
L2 Cache: 2048K 5015 MB/s ---> L2 Cache: 2048K 6980 MB/s (~39% change)
Memory : 4078M 1434 MB/s ---> Memory : 4078M 1992 MB/s (~39% change)
|
The lowest performance gain from CPB is 40%, which is quite significant.
UnixBench benchmark - CBP enabled
Running the benchmark with boost enabled:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
|
========================================================================
BYTE UNIX Benchmarks (Version 5.1.3)
System: apu2: GNU/Linux
OS: GNU/Linux -- 4.9.0-8-amd64 -- #1 SMP Debian 4.9.130-2 (2018-10-27)
Machine: x86_64 (unknown)
Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
CPU 0: AMD GX-412TC SOC (1996.1 bogomips)
Hyper-Threading, x86-64, MMX, AMD MMX, Physical Address Ext, SYSENTER/SYSEXIT, AMD virtualization, SYSCALL/SYSRET
CPU 1: AMD GX-412TC SOC (1996.1 bogomips)
Hyper-Threading, x86-64, MMX, AMD MMX, Physical Address Ext, SYSENTER/SYSEXIT, AMD virtualization, SYSCALL/SYSRET
CPU 2: AMD GX-412TC SOC (1996.1 bogomips)
Hyper-Threading, x86-64, MMX, AMD MMX, Physical Address Ext, SYSENTER/SYSEXIT, AMD virtualization, SYSCALL/SYSRET
CPU 3: AMD GX-412TC SOC (1996.1 bogomips)
Hyper-Threading, x86-64, MMX, AMD MMX, Physical Address Ext, SYSENTER/SYSEXIT, AMD virtualization, SYSCALL/SYSRET
15:03:32 up 1 min, 1 user, load average: 0.32, 0.10, 0.03; runlevel 2019-01-21
------------------------------------------------------------------------
Benchmark Run: Sat Jan 19 2019 15:03:32 - 15:31:32
4 CPUs in system; running 1 parallel copy of tests
Dhrystone 2 using register variables 7074813.7 lps (10.0 s, 7 samples)
Double-Precision Whetstone 1278.1 MWIPS (10.0 s, 7 samples)
Execl Throughput 846.3 lps (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 151426.3 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 42870.3 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 384498.1 KBps (30.0 s, 2 samples)
Pipe Throughput 430439.7 lps (10.0 s, 7 samples)
Pipe-based Context Switching 19094.7 lps (10.0 s, 7 samples)
Process Creation 1869.1 lps (30.0 s, 2 samples)
Shell Scripts (1 concurrent) 1934.0 lpm (60.0 s, 2 samples)
Shell Scripts (8 concurrent) 612.1 lpm (60.1 s, 2 samples)
System Call Overhead 572974.4 lps (10.0 s, 7 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 7074813.7 606.2
Double-Precision Whetstone 55.0 1278.1 232.4
Execl Throughput 43.0 846.3 196.8
File Copy 1024 bufsize 2000 maxblocks 3960.0 151426.3 382.4
File Copy 256 bufsize 500 maxblocks 1655.0 42870.3 259.0
File Copy 4096 bufsize 8000 maxblocks 5800.0 384498.1 662.9
Pipe Throughput 12440.0 430439.7 346.0
Pipe-based Context Switching 4000.0 19094.7 47.7
Process Creation 126.0 1869.1 148.3
Shell Scripts (1 concurrent) 42.4 1934.0 456.1
Shell Scripts (8 concurrent) 6.0 612.1 1020.2
System Call Overhead 15000.0 572974.4 382.0
========
System Benchmarks Index Score 310.2
------------------------------------------------------------------------
Benchmark Run: Sat Jan 19 2019 15:31:32 - 15:59:38
4 CPUs in system; running 4 parallel copies of tests
Dhrystone 2 using register variables 21308677.1 lps (10.0 s, 7 samples)
Double-Precision Whetstone 3647.7 MWIPS (10.0 s, 7 samples)
Execl Throughput 3445.1 lps (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 144800.2 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 40507.7 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 399019.8 KBps (30.0 s, 2 samples)
Pipe Throughput 1203354.7 lps (10.0 s, 7 samples)
Pipe-based Context Switching 103772.6 lps (10.0 s, 7 samples)
Process Creation 7718.8 lps (30.0 s, 2 samples)
Shell Scripts (1 concurrent) 5093.9 lpm (60.0 s, 2 samples)
Shell Scripts (8 concurrent) 644.0 lpm (60.2 s, 2 samples)
System Call Overhead 1471125.8 lps (10.0 s, 7 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 21308677.1 1825.9
Double-Precision Whetstone 55.0 3647.7 663.2
Execl Throughput 43.0 3445.1 801.2
File Copy 1024 bufsize 2000 maxblocks 3960.0 144800.2 365.7
File Copy 256 bufsize 500 maxblocks 1655.0 40507.7 244.8
File Copy 4096 bufsize 8000 maxblocks 5800.0 399019.8 688.0
Pipe Throughput 12440.0 1203354.7 967.3
Pipe-based Context Switching 4000.0 103772.6 259.4
Process Creation 126.0 7718.8 612.6
Shell Scripts (1 concurrent) 42.4 5093.9 1201.4
Shell Scripts (8 concurrent) 6.0 644.0 1073.4
System Call Overhead 15000.0 1471125.8 980.8
========
System Benchmarks Index Score 689.8
|
We clearly see that the overall score has increased:
- for 1 parallel copy of tests score increased from 258.7 to 310.2 (20% change)
- for 4 parallel copy of tests score increased from 688.9 to 689.8 (~0% change)
Summary
Enabling the CPB feature resulted in the performance increase and my experiments
show, that it is true. Although some methods did not report any change, it is
still software which may not report it correctly. stress
and stress-ng
seems
not to be the right tools to measure the performance.
Another reason of wrong reports is that the core performance states (P-states)
in boosted mode are not described in ACPI (Advanced Configuration and Power
Interface) system (and they shouldn’t be as AMD BIOS and Kernel Developer Guide
states). As a result operating system does not know about the fact of
processor’s transition to the state with higher, boosted performance.
CPB feature increases frequency only of one single core if the rest of the cores
is not stressed. The overall boost result is 20%, which implies the frequency
increase from 1000MHz to 1200MHz. However, the processor specification states,
that the frequency should be 1400MHz. A similar result has been achieved with
memtest86+ (approximately 40% memory speed gain). The benchmark result is also
biased by the background operations that OS must do besides the tests.
The feature will be introduced in v4.9.0.2 firmware release for PC Engines.
I hope this post was useful for you. Please try it out yourselves and feel free
to share your results.
If you think we can help in improving the performance of your platform or you
looking for someone who can boot your product by leveraging advanced features of
used hardware platform, feel free to
boot a call with us or
drop us email to contact<at>3mdeb<dot>com
. Are You interested in similar
content? Feel free to sign up for our newsletter
Michał Żygowski
Firmware Engineer with networking background. Feels comfortable with low-level development using C/C++ and assembly. Interested in advanced hardware features, security and coreboot. Core developer of coreboot. Maintainer of Braswell SoC, PC Engines, Protectli and Libretrend platforms.