What I wish I had known before scaling Uber to 1000 services

Why micro services

Move and release independently
Own your uptime (on-call)
Use the “best” tool for the job

What are the costs

Now you have a distributed system
Everything is an RPC
What if it breaks?

Less obvious costs

Everything is a trade-offs, you are going to give something up
You can build around problems
Might trade complexity for politics
You get to keep your biases

Tech stack

Pre-history: PHP (outsourced)
Dispatch: Node.JS, moving Go
Core Services: Python, moving to Go
Maps: Python and Java
Data: Python and Java
Metrics: Go

Different languages costs

Hard to share code
Hard to move between teams
What I wish I knew: Fragments the culture

RPC

HTTP/REST gets complicated
JSON needs a schema
RPCs are slower than PCs
What I wish I knew: Servers are not browsers

How many repos

With one repo you can make cross-project changes easier. Not really usable
Multiple repos
7000 repositories

Operational

What happens when things break? Downstream dependencies; people issues, people fix your service
Can other teams release your service?
Understand a service in the larger context

Performance

Depends on language tools. Tools are all different. Flamegraphs
If dashboards are not generated automatically, teams create their own. This could be solved as if you expose your service automatically, to generate the dashboards automatically and homogeneously across all teams
Doesn’t matter until it does. You can place SLAs
Probably want at least simple perf requirements
What I wish I knew: “good” not required, but “known” is

Fanout

Go for the slowest thing that you have in your chain
Overall latency >= latency of slowest.
Looking at p95, p99, p99.9

Tracing

Lots of ways to get this. Zipkin, Open-tracing, etc.
Best way to understand fanout
Probably want sampling
What I wish I knew: cross-lang context propagation

Logging

Need consistent, structure logging
Multiple languages makes this hard
Logging floods can amplify problems
What I wish I knew: Accounting

Load testing

Need to test against production
Without breaking metrics
Preferably all the time
What I wish I knew: All systems need to handle “test” traffic

Failure testing

What I wish I knew: People won’t like it

Migrations

Old stuff still has to work, you cannot stop the world, people are making changes all the time
What happened to immutable?
What I wish I knew: mandates are bad. People will be more willing for change if you provide a better system rather than forcing them to change the current one

Open source

Build/buy trade-off is hard
Commoditisation
What I wish I knew: This will make people sad

Politics

Services allow people to play politics. Politics happen when “Company > Team > Self” is out of order.

Trade-offs

Everything is a trade-off Try to make them intentionally