Mike, an old buddy of mine is one of the best database application development consultants I have ever met. We worked together for the same company for a couple of years before I got into network analysis, but a couple of months ago I found out that there was going to be a conference in my home town where Mike was on the organization team. After a friendly banter on Twitter about him having to come to my city (Düsseldorf; which guys from Cologne like Mike don’t like ;-)) he told me that I should turn in a proposal for a talk. I said I could do that, but not on any database development topic – but maybe a generic network application performance talk might be interesting for those guys attending. So I did, and it got refused, despite Mike advocating for me. Darn.
Well, it’s a nice topic for a blog post nonetheless. So here we go.
Application developers vs. the network
In my experience, developers don’t care much about networks – it’s basically in almost all cases considered to be a close-to-zero delay link from the users desktop to the database engine on the server. Same as as the database would be running on localhost, right? And the bandwidth used by typical database application is not really that big, so even 100Mbit is often way more than enough.
Now, there are a couple of problems I often see when looking at how application developers work (and I usually ranted about that in the network analysis classes I taught):
- they have the fastest PCs/Workstations in the whole company, with tons of RAM, fast hard drives (or SSDs, these days), and multi core CPUs running at really high frequencies
- The development environment runs on the same machine with the database, making connections to localhost
Now, if an application is deployed on a customer network, these parameters almost never apply: the users have machines that are sometimes many years older than the development workstations, and run on 100MBit links in most cases. The database is located in a data center that may not even be located in the same building, but it often has 1GBit or faster back bone connectivity. As soon as the application is rolled out, users complain because it’s so slow. But how can that be?
A client with a database performance problem
I had one case a couple of years back where a client called me because his application just would not perform anymore after a environment change. He told me that:
- When everything was “okay”, the database had run on a server next to the application server (it was a multi-tier architecture, with desktop applications connecting to the application server, which in turn queried the database engine). “Okay” meaning: users did not complain THAT much. “next to” meaning, both servers were physically connected to the same switch.
- Someone decided to move the database server to a different data center, about 10 kilometers away, because there were experienced OS admins for that kind of server at that other location
- connectivity between original data center and new data center was a wireless microwave link of 34MBit, which was considered more than enough bandwidth for the amount of database traffic
Well, after the server had been moved, users complained that instead of waiting a couple of seconds for some activities they now had to wait minutes sometimes. When my customer tested the connection, it was as fast as expected, he said. Of course I was curious and asked him how he performed that test, and it was exactly what I had expected: he ran an FTP download from one data center to the other, confirming that the transfer would achieve almost 30 MBit of throughput. I told him that a test like that doesn’t help at all, because he compared a throughput based application to a transaction based application. We finally agreed I’d come to the customer site and capture some of the traffic to see what was going on.
Database communication scenarios
For demonstration purposes I decided to set up an database and take 5 minutes to write a small client application that queries some simple things. So I got myself a demo database, set up some MySQL servers and ran my demo application in various setups.What the application does is
- pull a complete list of employees
- query each of the first 10,000 employee’s salary records afterwards.
So we should see one query for the employee table in the capture, and a lot (well, 10,000) queries following the first query, asking for the salary of each employee. It’s not elegant, but it’s something I often see when looking at how database programmers code their queries. The problem with that kind of program structure is obviously that after the first request all salary requests are serialized, going back and forth between client and server. That adds 10.000 times the round trip time between the two machines as a minimum delay until all data is transferred.
Scenario 1: database running on localhost
In this scenario, my client and the database run on the same system. The problem with that kind of setup is that Wireshark can’t capture localhost traffic on Windows systems, so I had to use rawcap instead (the interface with index 2 being my localhost interface).
When I looked at the capture, the Initial Round Trip Time was exactly 1 millisecond. I would have expected it to be a lot less, but my guess is that the timings in localhost captures are not as precise as I thought they would be (trying to use the -f parameter to flush frames right away without buffering them first made things worse, dropping tons of frames and not helping the timings). Another symptom added to that suspicion, because I had quite a number of dropped frames in the capture (about 3.1% of all frames). Anyway, for a speed test like this the drops do not matter much.
I filtered on the actual queries by using “mysql.command==3”, which I took from the decode of the first actual query:
To see the query statement itself I also added a new custom column by right-clicking on the statement line and selecting “Apply as column” from the pop up menu. The filtered capture looks like this:You can see in packet 8982 that a unseen packet was acknowledged; this is an ACK for a dropped frame (one of the 3.1% I already mentioned). Anyway, what is interesting here is that the “Delta Time” between requests for the salary records is between zero and one millisecond. I think that this is again caused by how the rawcap recording process works – normally, I’d expect the timings to be less digital (“0 or 1”) with more and smaller numbers in between. Let’s just say the delay between queries is what we see in worst case, which is 1 millisecond.
My test client (and the trace duration) both state that the whole process took about 5.7 seconds from start to finish.
Scenario 2: local area network
As a next step, I reproduced the setup of client and server being on different machines, but connected to the same switch. Interesting fact: the TCP initial RTT was 0.389 milliseconds this time (a lot faster than the localhost iRTT), and no dropped frames at all – I guess rawcap has some issues capturing all traffic with precise timings on localhost. It’s still a good tool to have though, because there is no other way to grab localhost packets on Windows.
Anyway, here’s what the Wireshark capture looks like, filtered on the queries again:
The delay between the salary queries are about 800 microseconds in worst case, and the whole process took about 6.7 seconds, so only about a second more than localhost. Not too bad, even though waiting for almost 7 seconds for a result from a database can be painful – so it’s not exactly good, either, but that is caused by the way the program is coded.
Scenario 3: remote data center
For this scenario I set up the database on a server in a data center about 10-15 km away, with the lowest link being 32MBit between client and server. So it’s pretty close to that customer situation a couple of years ago.
I could tell right away that the story was completely different this time, since it took ages until the client program finished the queries. Looking at the trace, the initial RTT was about 12.9 milliseconds, which is not bad for a connection like that, going through a couple of routers and switches. The connection also had a couple of lost packets that had to be retransmitted, which is always to be expected to some extent on connections that leave the local network.
Here’s the capture:
This time, the delay between queries is about 15ms, which is more than 15 times more than the local area network test. 10.000 queries times 15 ms are 150 seconds. Guess what? The tool (and total trace duration) both reported 156 seconds for the whole process, which is about 2.5 minutes.
It is important to keep an eye on latency if a database (or any other transaction oriented application) is deployed, because even a delay of a couple of milliseconds can add up dramatically based on how queries are coded. There’s two solutions to a situation like that:
- move the clients and servers closer to each other to reduce latency
- replace old components with newer devices with less latency (but that only makes sense if the difference between old and new is significant)
- write more elegant code that does not blindly query large batches of information it may not even need – or design the database in a way that important things can be pulled in one request.
Back to my client: the problem was that the application could not be rewritten to pull data more efficiently over that kind of distance. In the end, the server was in fact moved all the way back to where it had been, and things got back to normal right away.