- Ocado Technology uses Java to successfully develop applications that require high performance.
- The use of Discrete Event Simulations allow development teams to analyse performance over long periods of time, without waiting for results.
- Deterministic software is essential for efficient debugging. Real-time systems are inherently non-deterministic, so Ocado Technology has worked hard to reconcile this inconsistency.
- Onion architecture emphasizes the separation of concerns within your application. Using this architecture, Ocado Technology is continually adapting and changing its applications with relative ease.
At Ocado Technology, we are using state-of-the-art robotics to power our highly automated fulfillment centres, and at our site in Erith, the largest automated warehouse for online grocery, we will eventually enlist over 3,500 bots to process 220,000 orders per week. If you haven’t seen our bots in action, check them out on our YouTube channel.
Our robots move at 4m/s and within 5mm of each other! To orchestrate our robot swarms and maximise every bit of efficiency we can from our warehouses, we’ve developed a control system analogous to an air traffic control system.
We’ll walk through three of the typical decisions you need to make when starting to develop any application, and we’ll explain the language, development principles, and architecture choices we made for our control system.
Not everyone has the luxury to choose the programming language they use based purely on its technical merits and suitability to a particular problem. One oft-cited benefit of microservices and containerisation is the ability to adopt a polyglot development environment, but at many organisations other considerations have to be taken into account, such as:
- existing experience and expertise
- hiring considerations
- toolchain support
- corporate strategy
At Ocado Technology, we are heavily invested in Java – our control system is developed in Java. A common question we hear (and frequently ask ourselves!) is why are we using Java and not a language like C++ or, more recently, Rust. The answer – we are not only optimising our control system, but also the productivity of our developers, and this trade-off continually leads us to the use of Java. We choose to use Java because of its performance, development speed, evolving platform and recruitment. Let’s look at each of those factors in turn.
Some people believe that Java is “slower” than a comparable program written in C or C++, but this is actually a fallacy. There are well-known examples of high-performance applications written in Java, that prove what is possible in Java, such as the LMAX Disruptor. There are also many factors of application performance to take in consideration when comparing languages, for example, executable size, start up time, memory footprint and raw runtime speed. Also, comparing the performance of a particular application across two languages is inherently difficult, unless you are able to write the application comparably in both languages.
Whilst there are many recommended software practices to follow when developing high-performance applications in Java, within the JVM the Just-In-Time (JIT) compiler is likely the single most important concept to improve application performance in comparison to other languages. By profiling the running byte-code and compiling suitable bytecode down to native code at run-time, Java application performance can get very close to that of a native application. Further, as a JIT compiler runs at the last possible moment, it has information available to it that an AOT compiler cannot have, mainly the exact chipset on which an application is running and statistics about the actual application. With this information, a JIT compiler can perform optimisations an AOT compiler wouldn’t be able to guarantee are safe, so a JIT compiler can actually outperform an AOT compiler in some cases.
Many factors make developing in Java faster than other languages:
Because Java is a typed, high-level language, developers can focus on business problems and catch errors as early as possible.
Modern IDEs provide developers a wealth of tools to write correct code the first time.
Java has a mature ecosystem and there are libraries and frameworks for almost everything. Support for Java is almost ubiquitous across middleware technologies.
Java architect Mark Reinhold has stated that for twenty years, two of the biggest drivers for JVM development have been improvements in developer productivity and application performance. So over time, we’ve been able to benefit from gains in our first two concerns – performance and development speed – just by being on a constantly evolving and improving language and platform. For example, one of the observed performance improvements between Java 8 and Java 11 is the performance of the G1 garbage collector, which allows our control system more application time to perform computationally intensive calculations.
Last, but definitely not least for a growing company, being able to easily recruit developers is essential. In every index of popular languages, including Tiobe, GitHub, StackOverflow and ITJobsWatch, Java is always near or at the top. This position means we have a very large, global pool of developers from which to recruit the best talent.
After language choice, the second key decision we made in our system was the development principles or practices we adopted as a team to develop our application. The scale of decisions discussed here is akin to Jeff Bezos’s famous decision to make Amazon internally service oriented. These decisions are not easily changed, unlike a decision such as whether to use pair programming.
At Ocado Technology, we use three main principles to develop our control systems:
- Extensively simulating for testing and research
- Ensuring all our code can be run deterministically during R&D, and the same code can also run in a real-time context
- Avoiding premature optimisation
This Wikipedia article on simulation describes it as:
A simulation is an approximate imitation of the operation of a process or system; the act of simulating first requires a model is developed.
Within the context of a robotic warehouse, we have many processes and systems we can simulate, such as our automation hardware, warehouse operatives performing business processes, or even other software systems.
Simulating these aspects of our warehouses provides two main benefits:
- We have increased confidence a new warehouse design will provide the throughput for which we’ve designed it.
- We are able to test and validate algorithmic changes within our software, without testing on physical hardware.
To get any meaningful results in the two simulation scenarios above, we often need to run simulations of many days or weeks of warehouse operation. We could choose to run our systems in real-time and wait many days or weeks for our simulations to complete, but this is highly inefficient and we can do better using a form of Discrete Event Simulation (DES).
A DES works under the assumption that a system’s state only changes upon the processing of an event. Given this assumption, a DES can maintain a list of events to process and, between the processing of events, is able to jump forward in time, to the time of the next event. It is this “time-travel” which allows DESs, in most cases, to run much faster than the equivalent real-time code. This fast feedback for our developers and warehouse design teams improves our productivity.
It is worth explicitly stating that to be able to use Discrete Event Simulation, we’ve had to architect our control systems to be event-based and ensure that no state changes as time passes. This architecture requirement leads into the next development principle we use – determinism.
Real-time systems, by nature, are non-deterministic. Unless your system is using a real-time OS, which provides strict scheduling guarantees, a large part of non-deterministic behaviour can stem from the OS, it’s uncontrollable scheduling of events, and also the unpredictable observed processing time of an event.
Determinism is very important during the R&D of our control system, namely when we are running our simulations. Without determinism, if a non-deterministic error occurs, developers often have to resort to a mix of log trawling and ad-hoc testing in an attempt to reproduce the error, without any guarantee of actually being able to reproduce it. This can drain developers’ time and motivation.
Since real-time systems will never be deterministic, our challenge is to produce software that can run deterministically during a DES and also also non-deterministically in real-time. We do this by using our own abstractions – time and scheduling.
The following code snippet shows the time abstraction, introduced to have control over the passage of time:
@FunctionalInterface public interface TimeProvider long getTime();
Using this abstraction, we can provide an implementation that allows us to “time-travel” in our discrete event simulations:
public class AdjustableTimeProvider implements TimeProvider private long currentTime; @Override public long getTime() return this.currentTime; public void setTime(long time) this.currentTime = time;
In our real-time, production environment we can replace this implementation with one that relies on the standard system call for getting the time:
public class SystemTimeProvider implements TimeProvider @Override public long getTime() return System.currentTimeMillis();
For scheduling, we’ve also introduced our own abstraction and implementations, rather than rely on the Executor or ExecutorService interfaces within Java. We’ve done this because the Java executor interfaces don’t provide the deterministic guarantees we require. We’ll explore the reasons why later in the article:
public interface Event void run(); void cancel(); long getTime(); public interface EventQueue Event getNextEvent(); public interface EventScheduler Event doNow(Runnable r); Event doAt(long time, Runnable r); public abstract class DiscreteEventScheduler implements EventScheduler private final AdjustableTimeProvider timeProvider; private final EventQueue queue; public DiscreteEventScheduler(AdjustableTimeProvider timeProvider, EventQueue queue) this.timeProvider = timeProvider; this.queue = queue; private void executeEvents() Event nextEvent = queue.getNextEvent(); while (nextEvent != null) timeProvider.setTime(nextEvent.getTime()); nextEvent.run(); nextEvent = queue.getNextEvent(); public abstract class RealTimeEventScheduler implements EventScheduler private final TimeProvider timeProvider = new AdjustableTimeProvider(); private final EventQueue queue; public RealTimeEventScheduler(EventQueue queue) this.queue = queue; private void executeEvents() Event nextEvent = queue.getNextEvent(); while (true) if (nextEvent.getTime() <= timeProvider.getTime()) nextEvent.run(); nextEvent = queue.getNextEvent();
In our DiscreteEventScheduler you can observe the line timeProvider.setTime(nextEvent.getTime()) which represents the time travel described above.
Our RealTimeEventScheduler is an example of a busy-loop. This technique is usually discouraged because it wastes CPU time on useless activity. So why do we use a busy-loop scheduler within our control system? We’ll explore that next.
Every software developer is surely familiar with the quote from Donald Knuth:
“Premature optimization is the root of all evil.”
But, how many people know the full quote from which this is taken:
“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”
Within our warehouse control system, we are after those 3% of opportunities that allow our system to perform as optimally as possible! The previous busy-loop scheduler is one of those opportunities.
Because of the soft real-time nature of our system, we have the following requirements for our event scheduler:
- Events need to be scheduled for specific times.
- Individual events can’t be arbitrarily delayed.
- The system can’t allow the events to arbitrarily backup.
Initially, we chose to implement the simplest, most idiomatic Java solution, based on the ScheduledThreadPoolExecutor. This solution, by nature, meets the first requirement. To determine whether it satisfied our second and third requirements, we used our simulation capability to thoroughly performance test the solution. Our simulations allow us to run our control system at full warehouse volume over many days, to test the application behavior – usually well before any warehouse is actually running at its full volume. This testing revealed that the ScheduledThreadPoolExecutor based solution was unable to support the necessary warehouse volume. To understand why this solution was insufficient, we turned to profiling our control system, which highlighted two areas for focus:
- The moment an event is scheduled
- The moment an event is ready to be executed
Starting with the time when our event is scheduled, the ThreadPoolExecutor JavaDoc lists three queuing strategies:
- Direct handoffs
- Unbounded queues
- Bounded queues
A look at the JavaDoc internals of ScheduledThreadPoolExecutor shows that a custom, unbounded queue is being used and from the ThreadPoolExecutor JavaDoc we see that:
While this style of queuing can be useful in smoothing out transient bursts of requests, it admits the possibility of unbounded work queue growth when commands continue to arrive on average faster than they can be processed.
This tells us that our third requirement can be violated as events can backup in the unbounded work queue.
We turn again to the JavaDocs to understand the behaviour of the thread pool when a new event is ready to be executed. Depending on your thread pool configuration, it is possible that a new thread could be created for the event to be executed in. Again, from the ThreadPoolExecutor JavaDoc:
If fewer than corePoolSize threads are running, a new thread is created to handle the request, even if other worker threads are idle. Else if fewer than maximumPoolSize threads are running, a new thread will be created to handle the request only if the queue is full.
Thread creation takes time, which means our second requirement may also be violated.
It is all well and good theorizing about what might go wrong in your application, but until you thoroughly test it, you won’t know whether your chosen solution performs adequately or not. By re-running the same set of simulation tests, we were able to observe that a busy loop provided us with lower latency for individual events: from <5ms down to effectively 0, which is up to 3x higher throughput of events and it met all three of our event-scheduling requirements.
Our final decision, Architecture, means different things to different people.
To some, architecture refers to the implementation choices, such as:
- Monolith or microservices
- ACID transactions or eventual consistency (or more naively, SQL vs NoSQL)
- EventSourcing or CQRS
- REST or GraphQL
The implementation decisions made at the beginning of an application’s life are usually valid at that point in time. But as an application lives on, with features added and complexity inevitably increased, these decisions have to be revisited again and again.
To others, architecture is concerned with how you structure your code and application. If you acknowledge that these implementation decisions will change, then a good architecture ensures these changes can be made as easily as possible. One way we’ve achieved this is to follow Onion Architecture, which emphasizes the separation of concerns within your application.
Development principles often influence the architecture you chose. Our development principles have directed our architecture in a number of ways:
- Discrete event simulation required us to implement an event-based system.
- Enforcing determinism caused us to implement our own abstractions, rather than rely on standard Java abstractions.
- By avoiding premature optimisation and starting simply, our application started life as a single, deployable artefact. As many years passed, the application has grown into a monolith, which is still serving us well. We continually assess whether “now” is the time to optimise and re-factor to a different structure.
Consider Change in your System Design
If you are a system designer or software architect responsible for deciding which programming language to implement a high-performance system in, this article serves as evidence that Java is a key contender against the more “obvious” languages like C, C++ or Rust. If you are a Java programmer, this article has shown you an example of what is possible with the Java language.
The next time you design a system think about the principles and decisions you are making at the beginning of the project which will be extremely difficult or impossible to change. For us, these are our use of simulation and our focus on determinism. For aspects of the systems which might change, choose an architecture, such as Onion Architecture, that keeps the possibility of change open and easy.