Whether they're written down or part of a company's oral traditions, every software engineering team has guidelines its developers use to direct their efforts. These are the ones we use at One More Game.
Why do we need engineering tenets?
It’s useful to build prototypes to discover whether our designs are feasible by rapidly developing code that ignores error-handling and other essential elements of good software design like readability, reliability, performance, security, maintainability, and testability. If only we could ship our proof-of-concept work it would make development go so much faster!
Unfortunately for players, writing code rapidly without a focus on quality can become habitual, and we can lose sight of the costs associated with the technical debt created:
- High rates of player attrition
- Serious security defects
- Unplanned service outages
- Excessive operational costs
- Uneconomical customer support
- Labor-intensive product improvement
Allowing these problems to exist leads to the worst possible result: right after game launch the dev team is busy fixing bugs instead of adding content & features to reward player interest.
Public beta testing is not the answer. In the early years of online games players were so happy to play with their friends they would look past defects. Today’s players will not accept the frustrations in a game when so many others beckon. And once players’ attention is gone it may be impossible to reinvigorate their desire to play.
If you're not scared yet, remember that the first players who join -- the golden cohort -- play more, convert better, pay more, and retain longer than other players ... and they’re going to experience the least polished version of our game. We must avoid losing their trust by making sure our game is great early -- not a year after launch -- because they’re our best customers.
What are our development goals?
-
Rapid iteration is our highest priority because it enables us to solve other problems more quickly, whereas slowness reduces our opportunities to test with players before launch, increases the expense of fixing bugs, and decreases our ability to develop new content after launch. Aim to reduce the time it takes to build and run on our local computers, as well as the time to build, test, deploy, and release to players.
-
Frequent releases are essential to evaluate game changes. What is frequent? At a previous company we averaged 17 builds per day to our players -- roughly 17,000 builds over four years -- prior to public launch. The time from start-of-build to in-the-game was 3-5 minutes for code and design changes, while changes to art, sound and content took 20-25 minute (not bad; we could have done better).
-
Maintaining a high level of reliability enables players to have fun; they don't want to play "hunt the bug" when they could be enjoying a game.
With these goals in mind, here are guidelines we've created to help us achieve success.
Engineering Tenets
Development Methodology
-
We use prototyping to determine whether something is feasible, but when we're writing code with the intent to release to players we aim for a higher standard of quality.
-
We prefer frequent, incremental releases of reliable code over feature-completeness. Since there is a trade-off between scope (the size and complexity of a system) and reliability (how well code handles edge-cases and error conditions), we look to reduce scope.
-
We fix our bugs before writing new code. The time to fix bugs is unknowable: until we know the cause, we cannot accurately estimate time-to-fix, and that means any schedule estimates would be meaningless. See also: https://www.ministryoftesting.com/dojo/lessons/ten-reasons-why-you-fix-bugs-as-soon-as-you-find-them.
-
We aim to document our technical debt so it’s not a mystery later by adding
MY-NAME-HERE todo: EXPLANATION
, and by creating tasks for significant issues. Technical debt includes lack of error-handling, input validation, security checks, and other issues that prevent public launch. Adding notes to personal todo-lists and commit comments is insufficient because they’re too easy to miss. -
We merge frequently to
main
because it makes integration easier and reduces the time until we can get feedback from players. We use feature-flags to defer public release until we’re ready, and to facilitate disabling broken features in production. -
When errors occur we don't blame the developers, we aim to fix and educate. We seek the source of the problem to correct the issue, then take steps to avoid the same mistake in the future.
-
We endeavor to experience the pain of teammates and players. By understanding their troubles we can prioritize solutions that speed our development and reduce their suffering.
Systems Architecture
-
When we create systems we estimate direct costs (development and operational expenses) and indirect costs (user attrition, customer support, future maintenance) to understand the burdens we're creating, and perform research, design, prototyping, and peer review to alleviate these costs.
-
We seek to use omakase ("chef decides") instead of building for every possible use-case, because we know simple systems can be extended, whereas systems driven by extensive configuration are expensive due to their larger scope.
-
We build for multiplayer first. Single-player code is an easy special case of multiplayer code: just reduce the number of players to one. The reverse is not true -- we’ve learned that single player code must frequently be rewritten entirely to work in multiplayer.
-
We follow the mantra of “practice like you play” so as to discover problems before players see them. For example, we run our code on remote servers instead of local ones, and add simulated latency to our network code, so we’re running under similar conditions as players.
Coding
-
If our code breaks it's our bug even if the caller gave us bogus data, so we validate at API boundaries so as to prevent breakage inside our code.
-
We aim to minimize our use of threading and locks because they can lead to service reliability issues.
-
When we fix a bug, we aim to fix similar occurrences, as there may be other code that has the same type of problem. Gotta catch them all!
-
When operationalizing services we endeavor to automate runbooks, as manual control increases the likelihood of outages. We aim to automate where the cost of developing automation substantially reduces ongoing costs or developer pain.
-
We assume that hackers are at least as smart as we are, and do not rely upon secrecy for security. We make systems that fail closed, even in development, to avoid accidental back-doors. We do not trust the client, so we validate input, rate-limit, scan, audit and fuzz-test.
Testing
-
We are responsible for the testing and security of our systems. While our QA folks help, they are not responsible for finding our bugs and security defects, they're responsible for assessing product quality and game balance.
-
We know bugs happen, so we engineer for ease of testing by developing test automation and simulation determinism. Automation eliminates time-consuming aspects of testing and increases test-frequency. Determinism simplifies testing because the same inputs lead to the same outputs. We decouple simulation (game logic) from inputs (keyboard, mouse, network packets) to increase reproducibility. We use exception reporting, assertions, static asserts, logging, error recovery, recording & playback, and fault isolation to increase reliability.
-
We prefer API-testing to GUI-testing because it is less susceptible to breakage.
Final words
We use these tenets as guidelines, and if they’re not helping, we can change them together.
Intrigued by what you’ve read? Check our openings at One More Game