Case Study: Load Testing a Raw TCP Service
Last updated: November 2017
A customer approached us wanting to load test a network service via a proprietary binary TCP protocol. The customers use case involved thousands of simultaneous connections from mobile IoT (Internet of Things) devices.
We decided that this was a perfect opportunity to use Legion.
The protocol called for sequential request/response communication with no pipelining or interleaving of messages. There were multiple request and response types which had varying sizes and some fields of every request needed to be unique or dynamic. There seemed to be at least some server-side state being maintained for each connection and requests needed to be consistent with that state. The customer expected us to measure the time needed to receive a response, to measure bandwidth utilization, and to report statistics for every individual event.
At the time, Legion was still relatively new. We needed ad-hoc solutions for certain problems that are now handled within the standard logic of the framework. For example, tracking every individual event has never been one of Legion’s core design goals, and Legion at the time had no health self-assessments (it does now).
The primary challenge for this project, in our minds, was always the fact that it involved a proprietary binary protocol. By definition, it was not possible for us to have any experience with the customer’s protocol before the customer approached us. This necessarily introduced a certain amount of uncertainty into the project schedule, as we had to assume there would be stumbling blocks which we could not possibly anticipate or plan for. As it happened, there were certain decisions we made early in the project that gave us extra adaptability when issues did crop up.
Working with node-struct represented a modest extra effort in the opening stage of the project. It would have been possible to directly manipulate dynamic values in their raw binary form. In practice, node-struct paid considerable dividends over time. The ability to quickly and reliably convert the binary data to a human-readable format meant that we were able to quickly and thoroughly understand bugs or mistakes in the binary messages, and also gave us an enormous flexibility to iterate and adapt to rapidly meet customer needs.
With the ability to generate and edit messages, it was then necessary to build a rudimentary promise-based client API on top of Node’s TCP implementation. With the this API in hand the remaining work would be trivial, as Legion in it’s simplest mode of operation simply measures the time needed to fulfill a promise. This work was straightforward but detail-oriented, and will be covered in a separate article. ((TODO: LINK TO ARTICLE))
Finally, we implemented the test case logic. Each virtual user connected to the server, sent an initial connection request, and then sent a continuous series of requests selected from a pool of random possibilities. The test case itself was only about one page of code. We were able to modify the test case on-the-fly to answer specific questions posed by the customer.
The total time to complete the project was about 2 weeks. At that time we delivered all relevant assets to the customer for their ongoing use.
Following up on our experience with this project, we added a number of new capabilities. It’s now much easier to deploy large numbers of load generating instances, and it’s now possible to control these instances remotely at run time. We made it easier to report custom metrics. We’re also in the process of improving our ability to visualize and interpret the data that Legion produces.