Case study - debugging race condition in golang

One of the hardest and exhaustive problems in software engineering are race conditions. They are sneaky, cunning, and nasty, often changing form, disguised under many different stack traces. In this article, I’ll present my naive approach to this problem, why it wasn’t the best option, and what tool can help to solve the issue in projects written in Go.

The problem

One day, I was presented with the error in the fork of the socketio library. The socketio is the library implementing real-time communication between a browser and a server. In my version of the library, there was a race condition error that sporadically led to a library-panic.

Debugging it the hard way

When the program panics, the Go interpreter prints the stack trace on the standard output. The stack trace was my starting point on the hunt for the bug. A colleague also gave me the reproduction guide and the test repository with a simple app using the library to separate the problem from a project. The bug seemed to be connected with the pingLoop() and the NextWriter() functions. After some testing, I got three unique stack traces and had some foggy ideas about the root of the problem, but the solution was still under my horizon. The problem in my debugging method was that I could see only one stack trace at the time, but race conditions are caused by two or more threads competing about the access to the shared resource. This quickly led me to use print statements in many places scattered in the code and lots of logs to analyze. Gathered logs confirmed my assumptions. At some point, two threads were fighting for access to one variable. The library was detecting the concurrency and throwing an error. Sadly this method swiftly became similar to trying to brute-force the solution of the problem. I had one part of the problem in the stack trace, but couldn’t find the other half.

The better way to approach

At some point, a colleague hinted me to try -race option of Go.

1
2
export GORACE=history_size=7
go run -race main.go

Enabling this option was the solution to my problems. It quickly presented me with much more information than I could hope for. When the -race flag is enabled, Go carefully watches the code for any variable access, that could result in race condition. The only issue is, it sometimes appears to not work correctly, reporting a problem. [failed to restore the stack] But even with this issue, it allows us to gather very detailed information about the threads accessing the variable at the same time.

GORACE is the environment variable, where you can supply parameters for -race option. In this case case, I needed only a history_size parameter. Increasing history_size allocates more memory for the goroutine stack trace, but acceptable values are 0..7, so even with maximal history size, stack restoration will fail from time to time. You can read more about -race parameters in the docs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
==================
WARNING: DATA RACE
Read at 0x00c4201e0a8d by goroutine 26:
  internal/race.ReadRange()
      /usr/lib/go-1.10/src/internal/race/race.go:45 +0x42
  syscall.Write()
      /usr/lib/go-1.10/src/syscall/syscall_unix.go:183 +0xaa
  internal/poll.(*FD).Write()
      /usr/lib/go-1.10/src/internal/poll/fd_unix.go:256 +0x1c1
  net.(*netFD).Write()
      /usr/lib/go-1.10/src/net/fd_unix.go:220 +0x65
  net.(*conn).Write()
      /usr/lib/go-1.10/src/net/net.go:188 +0x84
  net.(*TCPConn).Write()
      <autogenerated>:1 +0x69
  github.com/gorilla/websocket.(*Conn).write()
      /home/ibagnucki/go/src/github.com/gorilla/websocket/conn.go:391 +0x19f
  github.com/gorilla/websocket.(*messageWriter).flushFrame()
      /home/ibagnucki/go/src/github.com/gorilla/websocket/conn.go:614 +0x87b
  github.com/gorilla/websocket.(*messageWriter).Close()
      /home/ibagnucki/go/src/github.com/gorilla/websocket/conn.go:724 +0xba
  github.com/3mdeb/socketio/engineio/parser.(*PacketEncoder).Close()
      /home/ibagnucki/go/src/github.com/3mdeb/socketio/engineio/parser/packet.go:114 +0x68
  github.com/3mdeb/socketio.(*encoder).encodePacket()
      /home/ibagnucki/go/src/github.com/3mdeb/socketio/parser.go:128 +0x658
  github.com/3mdeb/socketio.(*encoder).Encode()
      /home/ibagnucki/go/src/github.com/3mdeb/socketio/parser.go:82 +0xc8
  github.com/3mdeb/socketio.(*socket).send()
      /home/ibagnucki/go/src/github.com/3mdeb/socketio/socket.go:71 +0x22d
  github.com/3mdeb/socketio.(*socketHandler).Emit()
      /home/ibagnucki/go/src/github.com/3mdeb/socketio/handler.go:140 +0x3ec
  github.com/3mdeb/socketio.(*socket).Emit()
      /home/ibagnucki/go/src/github.com/3mdeb/socketio/socket.go:49 +0xa5
  main.main.func1.1()
      /home/ibagnucki/git/go-socketio-testing/main.go:66 +0x4e3
  runtime.call32()
      /usr/lib/go-1.10/src/runtime/asm_amd64.s:573 +0x3a
  reflect.Value.Call()
      /usr/lib/go-1.10/src/reflect/value.go:308 +0xc0
  github.com/3mdeb/socketio.(*caller).Call()
      /home/ibagnucki/go/src/github.com/3mdeb/socketio/caller.go:82 +0x502
  github.com/3mdeb/socketio.(*socketHandler).onPacket()
      /home/ibagnucki/go/src/github.com/3mdeb/socketio/handler.go:307 +0x532
  github.com/3mdeb/socketio.(*socket).loop()
      /home/ibagnucki/go/src/github.com/3mdeb/socketio/socket.go:128 +0x3f7
  github.com/3mdeb/socketio.(*Server).loop.func1()
      /home/ibagnucki/go/src/github.com/3mdeb/socketio/server.go:94 +0x38

Previous write at 0x00c4201e0a8d by goroutine 30:
  github.com/gorilla/websocket.(*messageWriter).flushFrame()
      /home/ibagnucki/go/src/github.com/gorilla/websocket/conn.go:593 +0xaec
  github.com/gorilla/websocket.(*messageWriter).Close()
      /home/ibagnucki/go/src/github.com/gorilla/websocket/conn.go:724 +0xba
  github.com/gorilla/websocket.(*Conn).beginMessage()
      /home/ibagnucki/go/src/github.com/gorilla/websocket/conn.go:473 +0x452
  github.com/gorilla/websocket.(*Conn).NextWriter()
      /home/ibagnucki/go/src/github.com/gorilla/websocket/conn.go:513 +0xa4
  github.com/3mdeb/socketio/engineio/websocket.(*Server).NextWriter()
      /home/ibagnucki/go/src/github.com/3mdeb/socketio/engineio/websocket/server.go:65 +0xd7
  github.com/3mdeb/socketio/engineio/transport.(Server).NextWriter-fm()
      /home/ibagnucki/go/src/github.com/3mdeb/socketio/engineio/server_conn.go:207 +0x6b
  github.com/3mdeb/socketio/engineio.(*serverConn).OnPacket()
      /home/ibagnucki/go/src/github.com/3mdeb/socketio/engineio/server_conn.go:219 +0x228
  github.com/3mdeb/socketio/engineio/websocket.(*Server).serveHTTP()
      /home/ibagnucki/go/src/github.com/3mdeb/socketio/engineio/websocket/server.go:99 +0x11f

Goroutine 26 (running) created at:
  github.com/3mdeb/socketio.(*Server).loop()
      /home/ibagnucki/go/src/github.com/3mdeb/socketio/server.go:93 +0xfc

Goroutine 30 (running) created at:
  github.com/3mdeb/socketio/engineio/websocket.NewServer()
      /home/ibagnucki/go/src/github.com/3mdeb/socketio/engineio/websocket/server.go:48 +0x187
  github.com/3mdeb/socketio/engineio.(*serverConn).ServeHTTP()
      /home/ibagnucki/go/src/github.com/3mdeb/socketio/engineio/server_conn.go:185 +0x23d
  github.com/3mdeb/socketio/engineio.(*Server).ServeHTTP()
      /home/ibagnucki/go/src/github.com/3mdeb/socketio/engineio/server.go:148 +0x291
  github.com/3mdeb/socketio.(*Server).ServeHTTP()
      /home/ibagnucki/go/src/github.com/3mdeb/socketio/server.go:78 +0x6c
  net/http.(*ServeMux).ServeHTTP()
      /usr/lib/go-1.10/src/net/http/server.go:2340 +0x9f
  net/http.serverHandler.ServeHTTP()
      /usr/lib/go-1.10/src/net/http/server.go:2697 +0xb9
  net/http.(*conn).serve()
      /usr/lib/go-1.10/src/net/http/server.go:1830 +0x7dc
==================

This tool quickly pointed me to the problematical place in code. It showed me, that the problem is located in two functions that lacked a mutex shared with the NextWriter() function. The solution to the problem was to add mutex to the Write() and Close() functions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
// Write writes bytes p.
func (e *PacketEncoder) Write(p []byte) (int, error) {
    WriterLocker.Lock()
    defer WriterLocker.Unlock()
    return e.w.Write(p)NextWriter
}

// Close closes the encoder.
func (e *PacketEncoder) Close() error {
    WriterLocker.Lock()
    defer WriterLocker.Unlock()
    if e.closer != nil {
        return e.closer.Close()
    }
    return nil
}

Summary

Race conditions are quite challenging problems. Solving them may be a lot of work, but if you know well your toolset, it can get easier. If you are working in an environment you are not familiar with, look for available tools. Especially with Go, use -race flag.

If you think we can help in improving the security of your firmware or you looking for someone who can boost your product by leveraging advanced features of used hardware platform, feel free to book a call with us or drop us email to contact<at>3mdeb<dot>com. If you are interested in similar content feel free to sign up for our newsletter


Igor Bagnucki
Junior Embedded Systems Developer at 3mdeb. Passionate about wild tent trips in mountains and embedded software engineering.