Architect BlogHow fast is it, really?
On latency, measurement, and optimization in algorithmic trading systems
Brett HarrisonSeptember 5, 2023

"The speed of light sucks." - John Carmack

Software engineers within the world of low-latency automated trading (colloquially known as "high-frequency trading" or HFT) obsess over speed. From purchasing private bandwidth between microwave towers to analyzing x86 instructions from different compiler versions, those with experience in this industry have seen colossal time and expense committed to the problem of optimizing code and network paths for minimal execution times.

But how does one actually measure how fast a program is? To the uninitiated, it sounds like a simple task. However there are many layers of complexity in measuring the true latency of a trading system, or even defining what to measure in the first place. Understanding latencies in algorithmic trading systems can present Heisenberg-esque dilemmas, where the more code you write to measure latency the more overhead you add to your program and therefore the more inaccurate your measurements become.

At Architect we have been using a combination techniques to measure the latency of various codepaths and processes that comprise our institutional trading technology suite. Let's explore some solutions to these problems.

--

Let's say you've written an algorithmic trading strategy. Your strategy reacts to market trades in an instrument, perhaps by computing a proprietary model valuation, and sends an order in that instrument provided certain conditions are met. You would like to measure the time that this reaction takes, so that you can reduce the time as much as possible. Let's use Python-style pseudocode to describe the program (although in practice it is most common to use languages like, C, C++, and Rust where optimal program latencies are required):

def on_market_trade(self, instrument, market_trade):
    model_value = self.compute_model_value(instrument, market_trade) 
    order = self.compute_order_decision(instrument, model_value) 
    if order is not None:
        self.send_order(order)

A reasonable place to start in understanding the latency of your critical codepath is to wrap timers around the functions doing the heavy lifting:

def on_market_trade(self, instrument, market_trade):
    start_time = datetime.now()
    model_value = self.compute_model_value(instrument, market_trade) order =
<<<<<<< HEAD
    self.compute_order_decision(instrument, model_value) 
    end_time = datetime.now()
=======
    self.compute_order_decision(instrument, model_value) end_time = datetime.now()
>>>>>>> 9f9d35e (blog)
    self.add_time_sample(end_time - start_time) 
    if order is not None: 
        self.send_order(order)

The function self.add_time_sample would add the elapsed time to a histogram that you could print statistics for at the end of your program's lifecycle, or on some regular basis based on time or number of samples observed. There are many issues with the above approach:

  1. It measures the time required to compute the order decision, but does not include the time it takes to send the actual order.
  2. It observes computation time on every market trade, rather than just the trades that result in orders -- this can bias results because the most interesting times to send orders may be the ones where your program is running the slowest due to volume of market events or other factors.
  3. datetime.now() itself is a slow, expensive function that can impact the runtime speed and memory profile of the code above, which adds up if your program is already operating on a microsecond-timescale. The typical way to fix this last issue is to use native performance counters that most programming language have primitives to access.

Here's a new code sample that attempts to fix the above:

def on_market_trade(self, instrument, market_trade):
    start_time = time.perf_counter_ns()
    model_value = self.compute_model_value(instrument, market_trade) 
    order = self.compute_order_decision(instrument, model_value) 
    if order is not None:
        self.send_order(order) 
        end_time = time.perf_counter_ns() 
        self.add_time_sample(end_time - start_time)

This is an improvement, but are we really getting at the full latency of the trading system? The above doesn't include significant elements of the critical path, including the time to parse the market trade update, or anything involving network I/O. Let's take a step back and trace a market data update through the complete critical path of the automated trading system (ATS):

  1. Network packet containing the market trade hits the network card of the box where the ATS is running (sent from the exchange)
  2. The packet is passed to the runtime of the ATS
  3. The ATS parses the bytes of the packet to pull out necessary fields (such as trade price or trade size)
  4. The ATS computes a model value and makes a decision to send an order
  5. The internal memory representation of the order is converted to the protocol of the exchange that the order is being sent to
  6. The ATS makes function calls to pass the order bytes to the network card of the box for sending
  7. The network card of the box sends the order bytes to the exchange

(There are many details missing from the above, such as the multiple methods of going from steps 1 to 2 and 6 to 7, but we are omitting those for simplicity for now.)

The code sample above is only measuring steps 4, 5, and 6. I have seen many real-world instances where 90% or more of the full latency profile was present in 1, 2, 6, and 7. A large chunk of latency could be incurred in step 3 if performed uncarefully, or if there is any order-book-building necessary in steps 3 and 4.

To truly capture all seven steps, you can set up this alternative method for measuring latency:

  • Write a program that simulates the exchange market data, by sending random market trade events on a timer
  • Have that same program simulate the exchange itself by receiving orders in the exchange's native protocol
  • Have the simulator timestamp the market trades with the current time right before sending
  • Configure the ATS to receive data from the simulator and send orders to the simulator. Have the ATS attach the exchange trade timestamp on the order it sends to the exchange, or if that's not possible in the protocol have it record a mapping from order id to exchange trade timestamp
  • Have the simulator write down the timestamp when it receives orders from the ATS. From either the data on the order itself or from the ATS's mapping of order id to exchange trade timestamp, compute the difference between market trade send time and order receive time.

While the above does completely capture the full critical path, it provides too conservative an estimate of latency: it also captures a similar codepath for the simulator itself! To get closer to the right answer, you can write another simulation Exchange and also a simulation ATS that just ping/pongs a single timestamp back and forth without any protocol translation, model building, order sending, etc. This provides a baseline for inter-program latency that could be subtracted from the above experiment.

The as-close-as-possible-to-perfect solution involves a much more advanced setup, where modern switching hardware is used to replicate packet traffic in and out of the box and the raw network packets are parsed and correlated for timestamping. But I'll leave those details for a future post.

Writing fast algorithmic trading system code is hard. Measuring it properly is even harder. At Architect we have created institutional-grade low-latency trading software for both regulated derivatives and digital assets, so that you can let us do the work for you.

Get in touch.
Schedule a demo >
Trading futures and options involves substantial risk of loss and is not suitable for all investors. Performance is not necessarily indicative of future results.© 2023 Architect Financial Technologies, Inc.