When you come to dev.splunk.com, you see pictures of beer pong, full bars, stuffed ponies with fart machines taped to their ass, etc – basically engineers gone wild. Somewhere between all of this insaneness, we actually find the time to write code and solve problems like this one.This post is all about a crazy-weird performance issue that we were experiencing, how it manifested itself and ultimately how it was fixed.
I suspect others may be having this problem, as the problem lives in some very popular open source code as far as I can tell. With that, I’ll begin telling you about my journey into hell.
Splunk has a home grown embedded HTTP(S) server that serves up all external interfaces to the ‘splunkd’ daemon. We use it as the core engine for our REST and XML/RPC-like API’s. The GUI and the CLI both end up talking to the daemon via this server.
When I wrote the core of it a few months ago, I ran some rudimentary performance tests on several platforms and it seemed decent enough for our use, but a week ago, the manager of the Search and Indexing team (Stephen) said that he was seeing abysmal performance using SSL. He said that the GUI performance was being impacted. I didn’t believe him and insisted that it was something else and that he was high.
So to prove to him that it wasn’t my server, or my problem like all engineers do, I gave him a small python script that hits the server in a tight loop and we checked the performance. It sucked. Continuing with the theme of “this isn’t my problem” – I told him it was probably the handler of the request that was doing something that made the server seem slow. This is when he laughed at me and said “watch this”: He proceeds to turn off SSL, re-run the same test and the performance of the server goes up by approximately 50X. 50 times faster! I know that SSL is slower than non-encrypted streams, but there was no way this was the problem. Whoa! We can’t ship this way. This needs to be fixed!
In fact, a very small HTTP request (approx. 80 byte) with a small reply (approx. 300 bytes) was operating at only 23 requests/sec! When he turned off SSL, he was getting over 1000 req/sec! What???
So, of course I tried the same test on my OSX laptop and I got 130+ req/sec – within the realm of reasonable and certainly better than 24. I then tried running the server on my laptop and the client on my Linux Fedora machine resulting in basically the same performance. Why does this work on my hardware and not his?
Finally, I switched the server and client by putting the server on my Linux box and the client on my Mac. I re-ran the test and damned if the performance didn’t completely suck! I was getting 20 or so request-replies per second over SSL.
But, why does the OS matter? I didn’t get it.
My SSL Performance Bug Diary
- Broke out ssldump. Here is a snippet from an OSX client and a Linux server. Note the third C->S line of .0398 seconds. This is the cause of the slowdown, but why?
- Spent 2 hours looking over every possible OpenSSL build option and try turning various ones on and off. No difference. (score: Bug 1, Rob 0)
- Spend many hours trying different crypto combinations. Little difference beyond the obvious and documented performance differences. (score Bug 2, Rob 0)
- Perhaps I need to throw in server-side SSL caching. I throw it in, with the assumption that the python client implements client-side SSL caching. No performance change. (score: Bug 3, Rob 0)
- Thinking it might be the Nagle algorithm, I modify my test to send larger requests and guess what? The performance is normal again! I try to find out exactly when it turns from slow to fast (as far as the request size) by trying request sizes of 1, 2, 4, 8, 16, 32……..16K bytes. Wow, just around 1300-1400 bytes is where the performance goes from sucks to fast. Look at the graph below. See the spike? Hmmm….. (score: Bug 3, Rob 1)
- I change the MTU on the server from the default of 1500 bytes to 1000 bytes. The performance cliff now is lowered to somewhere in the 800-900 byte range. The MTU is the key! (score: Bug 3, Rob 2)
- It’s got to be the Nagle algorithm. I try turning off the Nagle algorithm on the server. No performance change. (score: Bug 5, Rob 2)
- I give the problem to our performance engineer. He can reproduce it. I suck.
- Decide to try ssldump again and this time try a different test – curl sending the same size request as in the python test. I want to compare timings. BINGO. It’s not the server, it’s a combination of the server running on Linux and Python. (score: Bug 5, Rob 3). Notice in the following curl ssldump image, the single C->S line and the fast .0007 second timing. Contrast this to the previous ssldump image and here enlies the problem :
- Now to fix it. It really really seems like Python is the problem. I try it with urllib2. Same thing.
- I try it with httplib2. Same thing.
- I look at the code for urllib2 and httplib2 and guess what? They both use httplib. The problem must be in httplib. I dig into the code and start commenting shit out and looking at the resulting ssldump output to figure out *exactly* which write is causing the damage. I find the bug. (score: Rob wins)
The Problem and the Fix
I forgot to tell you that we are using Python 2.5. It turns out that httplib.py sends requests over the wire in 2 chunks. The first chunk is comprised of the HTTP headers. The second chunk is the body. The fix I made appends the body to the headers and sends the request in 1 chunk only. This is what curl does and this fixes the performance problems.
Here is the fix for download:
Here is my final data:
Things I Still don’t Understand
Because it seems to work and this took so damn long, I am not going to do any further investigations, but there are still many unsolved mysteries. Perhaps one of you can figure them out.
- Why the extreme falloff on linux where both the client and server are on the same machine at 16K request/reply size?
- Why is OSX so much slower than linux?
- Why does the new code speed up linux only?
- Notice that only the OSX box gets the speed up at the MTU, the Linux box continues the slow performance regardless of the MTU
Windows to Linux Performance Numbers (added 2/5/08)
So I added a Windows to Linux graph based on the first comment I received below. Yes, we do test with Windows, and yes, it is not out yet (but will be soon). The problem manifests itself exactly like it does on other platforms. Notice the difference:
Specs on the Test Hardware
- Dual Core, very fast, lots of Ram (will provide detailed specs in a bit)
- 3.4Ghz P4, Hyperthreaded, 2G Ram
- Mac Pro Laptop
- 1.8Ghz Pentium Core II duo (2 cores), 3G Ram