One of the hardest and exhaustive problems in software engineering are race
conditions. They are sneaky, cunning, and nasty, often changing form, disguised
under many different stack traces. In this article, I’ll present my naive
approach to this problem, why it wasn’t the best option, and what tool can help
to solve the issue in projects written in Go
.
The problem
One day, I was presented with the error in the
fork of the socketio library. The
socketio
is the library implementing real-time communication between a browser
and a server. In my version of the library, there was a race condition error
that sporadically led to a library-panic.
Debugging it the hard way
When the program panics, the Go
interpreter prints the stack trace on the
standard output. The stack trace was my starting point on the hunt for the bug.
A colleague also gave me the reproduction guide and the test repository with a
simple app using the library to separate the problem from a project. The bug
seemed to be connected with the
pingLoop()
and the
NextWriter()
functions. After some testing, I got three unique stack traces and had some
foggy ideas about the root of the problem, but the solution was still under my
horizon. The problem in my debugging method was that I could see only one stack
trace at the time, but race conditions are caused by two or more threads
competing about the access to the shared resource. This quickly led me to use
print statements in many places scattered in the code and lots of logs to
analyze. Gathered logs confirmed my assumptions. At some point, two threads were
fighting for access to one variable. The library was detecting the concurrency
and throwing an error. Sadly this method swiftly became similar to trying to
brute-force the solution of the problem. I had one part of the problem in the
stack trace, but couldn’t find the other half.
The better way to approach
At some point, a colleague hinted me to try -race
option of Go
.
1
2
|
export GORACE=history_size=7
go run -race main.go
|
Enabling this option was the solution to my problems. It quickly presented me
with much more information than I could hope for. When the -race
flag is
enabled, Go
carefully watches the code for any variable access, that could
result in race condition. The only issue is, it sometimes appears to not work
correctly, reporting a problem. [failed to restore the stack]
But even with
this issue, it allows us to gather very detailed information about the threads
accessing the variable at the same time.
GORACE
is the environment variable, where you can supply parameters for
-race
option. In this case case, I needed only a history_size
parameter.
Increasing history_size
allocates more memory for the goroutine stack trace,
but acceptable values are 0..7, so even with maximal history size, stack
restoration will fail from time to time. You can read more about -race
parameters in
the docs.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
|
==================
WARNING: DATA RACE
Read at 0x00c4201e0a8d by goroutine 26:
internal/race.ReadRange()
/usr/lib/go-1.10/src/internal/race/race.go:45 +0x42
syscall.Write()
/usr/lib/go-1.10/src/syscall/syscall_unix.go:183 +0xaa
internal/poll.(*FD).Write()
/usr/lib/go-1.10/src/internal/poll/fd_unix.go:256 +0x1c1
net.(*netFD).Write()
/usr/lib/go-1.10/src/net/fd_unix.go:220 +0x65
net.(*conn).Write()
/usr/lib/go-1.10/src/net/net.go:188 +0x84
net.(*TCPConn).Write()
<autogenerated>:1 +0x69
github.com/gorilla/websocket.(*Conn).write()
/home/ibagnucki/go/src/github.com/gorilla/websocket/conn.go:391 +0x19f
github.com/gorilla/websocket.(*messageWriter).flushFrame()
/home/ibagnucki/go/src/github.com/gorilla/websocket/conn.go:614 +0x87b
github.com/gorilla/websocket.(*messageWriter).Close()
/home/ibagnucki/go/src/github.com/gorilla/websocket/conn.go:724 +0xba
github.com/3mdeb/socketio/engineio/parser.(*PacketEncoder).Close()
/home/ibagnucki/go/src/github.com/3mdeb/socketio/engineio/parser/packet.go:114 +0x68
github.com/3mdeb/socketio.(*encoder).encodePacket()
/home/ibagnucki/go/src/github.com/3mdeb/socketio/parser.go:128 +0x658
github.com/3mdeb/socketio.(*encoder).Encode()
/home/ibagnucki/go/src/github.com/3mdeb/socketio/parser.go:82 +0xc8
github.com/3mdeb/socketio.(*socket).send()
/home/ibagnucki/go/src/github.com/3mdeb/socketio/socket.go:71 +0x22d
github.com/3mdeb/socketio.(*socketHandler).Emit()
/home/ibagnucki/go/src/github.com/3mdeb/socketio/handler.go:140 +0x3ec
github.com/3mdeb/socketio.(*socket).Emit()
/home/ibagnucki/go/src/github.com/3mdeb/socketio/socket.go:49 +0xa5
main.main.func1.1()
/home/ibagnucki/git/go-socketio-testing/main.go:66 +0x4e3
runtime.call32()
/usr/lib/go-1.10/src/runtime/asm_amd64.s:573 +0x3a
reflect.Value.Call()
/usr/lib/go-1.10/src/reflect/value.go:308 +0xc0
github.com/3mdeb/socketio.(*caller).Call()
/home/ibagnucki/go/src/github.com/3mdeb/socketio/caller.go:82 +0x502
github.com/3mdeb/socketio.(*socketHandler).onPacket()
/home/ibagnucki/go/src/github.com/3mdeb/socketio/handler.go:307 +0x532
github.com/3mdeb/socketio.(*socket).loop()
/home/ibagnucki/go/src/github.com/3mdeb/socketio/socket.go:128 +0x3f7
github.com/3mdeb/socketio.(*Server).loop.func1()
/home/ibagnucki/go/src/github.com/3mdeb/socketio/server.go:94 +0x38
Previous write at 0x00c4201e0a8d by goroutine 30:
github.com/gorilla/websocket.(*messageWriter).flushFrame()
/home/ibagnucki/go/src/github.com/gorilla/websocket/conn.go:593 +0xaec
github.com/gorilla/websocket.(*messageWriter).Close()
/home/ibagnucki/go/src/github.com/gorilla/websocket/conn.go:724 +0xba
github.com/gorilla/websocket.(*Conn).beginMessage()
/home/ibagnucki/go/src/github.com/gorilla/websocket/conn.go:473 +0x452
github.com/gorilla/websocket.(*Conn).NextWriter()
/home/ibagnucki/go/src/github.com/gorilla/websocket/conn.go:513 +0xa4
github.com/3mdeb/socketio/engineio/websocket.(*Server).NextWriter()
/home/ibagnucki/go/src/github.com/3mdeb/socketio/engineio/websocket/server.go:65 +0xd7
github.com/3mdeb/socketio/engineio/transport.(Server).NextWriter-fm()
/home/ibagnucki/go/src/github.com/3mdeb/socketio/engineio/server_conn.go:207 +0x6b
github.com/3mdeb/socketio/engineio.(*serverConn).OnPacket()
/home/ibagnucki/go/src/github.com/3mdeb/socketio/engineio/server_conn.go:219 +0x228
github.com/3mdeb/socketio/engineio/websocket.(*Server).serveHTTP()
/home/ibagnucki/go/src/github.com/3mdeb/socketio/engineio/websocket/server.go:99 +0x11f
Goroutine 26 (running) created at:
github.com/3mdeb/socketio.(*Server).loop()
/home/ibagnucki/go/src/github.com/3mdeb/socketio/server.go:93 +0xfc
Goroutine 30 (running) created at:
github.com/3mdeb/socketio/engineio/websocket.NewServer()
/home/ibagnucki/go/src/github.com/3mdeb/socketio/engineio/websocket/server.go:48 +0x187
github.com/3mdeb/socketio/engineio.(*serverConn).ServeHTTP()
/home/ibagnucki/go/src/github.com/3mdeb/socketio/engineio/server_conn.go:185 +0x23d
github.com/3mdeb/socketio/engineio.(*Server).ServeHTTP()
/home/ibagnucki/go/src/github.com/3mdeb/socketio/engineio/server.go:148 +0x291
github.com/3mdeb/socketio.(*Server).ServeHTTP()
/home/ibagnucki/go/src/github.com/3mdeb/socketio/server.go:78 +0x6c
net/http.(*ServeMux).ServeHTTP()
/usr/lib/go-1.10/src/net/http/server.go:2340 +0x9f
net/http.serverHandler.ServeHTTP()
/usr/lib/go-1.10/src/net/http/server.go:2697 +0xb9
net/http.(*conn).serve()
/usr/lib/go-1.10/src/net/http/server.go:1830 +0x7dc
==================
|
This tool quickly pointed me to the problematical place in code. It showed me,
that the problem is located in two functions that lacked a mutex shared with the
NextWriter()
function. The solution to the problem was to add mutex to the
Write()
and
Close()
functions.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
// Write writes bytes p.
func (e *PacketEncoder) Write(p []byte) (int, error) {
WriterLocker.Lock()
defer WriterLocker.Unlock()
return e.w.Write(p)NextWriter
}
// Close closes the encoder.
func (e *PacketEncoder) Close() error {
WriterLocker.Lock()
defer WriterLocker.Unlock()
if e.closer != nil {
return e.closer.Close()
}
return nil
}
|
Summary
Race conditions are quite challenging problems. Solving them may be a lot of
work, but if you know well your toolset, it can get easier. If you are working
in an environment you are not familiar with, look for available tools.
Especially with Go
, use -race
flag.
If you think we can help in improving the security of your firmware or you
looking for someone who can boost your product by leveraging advanced features
of used hardware platform, feel free to
book a call with us or
drop us email to contact<at>3mdeb<dot>com
. If you are interested in similar
content feel free to sign up for our newsletter
Igor Bagnucki
Junior Embedded Systems Developer at 3mdeb. Passionate about wild tent trips in mountains and embedded software engineering.