Mastering Chaos: A Netflix Guide to Microservices

Microservices basics

What microservices are not:

Monolithic code base. Everyone was contributing to a single code base that was released once a week, when change introduced a problem it was difficult and slow to debug. Because so many changes were released on a single deploy.
Monolithic database. When this went down, everything went down. Scaling it vertically was very expensive.
Tightly coupled architecture. One of the most painful points was the lack of agility, everything was deeply interconnected.

What is a microservice: An evolutionary response

Separation of concerns. Modularity, encapsulation
Scalability. Horizontally scaling, workload partitioning
Virtualisation and elasticity. Automated operations, on demand provisioning

Microservices as organs:

Each organ has a purpose
Organs form systems
Systems form an organism

Microservices are an abstraction:

You have a service that provides some functionality
The service may need to access some persistence mechanism like a database
You may provide service client for accessing data operations
You may provide an cache for your client (peg: EVCache)
You may need to provide some orchestration between the client and the cached client, so you may need to really provide a client library that will go to the cache first, and if it fails will go for the client that will hit the microservice and persistence layer and then backfill the cache for the next call
All this will have to be embedded within the client application that wants to use the

From the consuming application the client library, that includes the service client cache, the service client, the service and the database is the microservice. Is not a simple state thing.

Challenges and solutions

Dependency

Infra-service request

Everything is great until something breaks.

Network latency, congestion, failure.
Logical or scaling failure.

Cascading failure, one service fails without defences it can cascade and take down your entire network.

Solution: Hystrix

Structured way of handling timeouts and retries
Fallbacks, if I cannot call a service, can I return some static response instead (degraded service) to allow the customer to continue using the product
Isolated threadpools and the concept of circuits. If you keep hammering the service and it keeps failing, maybe you should stop calling it and fail fast returning the fallback

How do you know if it works at scale? Netflix created FIT (Fault Injection Testing) to test this. It inject failure information metadata to Zuul, and this is carried through the network.

Synthetic transactions
Override by device or account
% of live traffic up to 100%
Enforced throughout the call path

How do we constraint testing scope? (So you are not testing millions of permutations, or downstream dependencies of the services you test). To address this Netflix defined the critical microservices to have basic functionality to work, which is not all of them, and test only those (by blacklisting all of the other services that are not critical). This has worked great to make sure that the service actually functions when all those dependencies go away. This is a much simpler approach than doing point-to-point interactions.

Client libraries

Many clients
Common business logic
Common access patterns

Trade-offs for client libraries

Heap consumption
Logical defects
Transitive dependencies

We can limit the client libraries, try to simplify them as much as possible.

Persistence

CAP theorem

In the presence of a network partition, you must choose between between consistency and availability.

Netflix chose availability via Cassandra, systems are eventually consistent.

Infrastructure Do not put all your eggs into one basket. You can go multi-region.

Scale

Stateless services

Not a cache or a database
Frequently accessed metadata
No instance affinity
Loss a node is non-event, you can replace a node at no cost

Auto scaling groups is fundamental for microservices, advantages

Compute efficiency, on-demand capacity
Node failure, nodes gets replaced easily
Traffic spikes, DDoS attack, etc.
Performance bugs, auto-scaling allows you to absorb damage while figuring out what happened

Surviving instance failure, thanks to Chaos Monkey (losing individual nodes).

Stateful services

Databases and caches
Custom apps which hold large amounts of data
Loss of a node is a notable event, it could take hours to recover

Redundancy is fundamental, EVCache similar to memcache but it writes to several zones for redundancy

Hybrid services

It’s easy to take EVCache for granted

30 million requests/sec
2 trillion requests per day globally
Hundreds of billions of objects
Tens of thousands of memcached instances
It consistency scales in a linear way, no matter the load. Milliseconds of latency per requests

Problem is you may rely too much on EVCache. Solutions

Workload partitioning, split cache by workload (real-time vs batch processes)
Request-level caching, so you are not repeat hitting the service over and over. Make the first hit expensive and the rest of them free through the lifecycle of the application
Secure token fallback, embed a secure token through the requests. If the subscriber service is unavailable, fallback to a datastore with that encrypted token so you have enough information to identify the customer and provide basic operation
Chaos under load, use tools to test your architecture

Variance within your architecture

The more variety you have in your system, the more complex and difficult to manage it becomes.

Operational drift, which happens over time

Unintentional, it happens eventually.

Over time

Alert thresholds
Timeouts, retries, fallbacks
Throughput (RPS)

Across microservices

Reliability best practices

The first time you talk with teams about this, they will be very enthusiastic; however, as this is tedious and repetitive, and is not related to product work, people will tend to avoid it.

They solved the cycle with continuous learning and automation. An incident gets a resolution, then a review happens, a remediation plan, some analysis, then extract some learning into best practice, then you automate wherever possible, then you drive adoption.

Production ready checklist

Alerts
Apache and Tomcat
Automated canary analysis
Autoscaling
Chaos
Consistent naming
ELB config
Healthcheck
Immutable machine images
Squeeze testing
Staged, red/black deployments
Timeouts, retries, fallbacks

Polyglot (new languages) and containers

Intentional variance.

The paved road, focused on Java and EC2

Stash
Nebula/Gradle
BaseAMI/Ubuntu
Jenkins
Spinnaker
Runtime platform

Some engineers went off road, building their own roads, doing stuff in Python, Ruby, NodeJS, each one of them provided value in some sense. When Docker was introduced, things went a bit wild.

Cost of variance

Productivity tooling
Insight and triage capabilities
Base image fragmentation
Node management
Library/platform duplication
Learning curve, production expertise

Instead of a single paved road, now we have multiple paved roads that makes life difficult for the teams that support engineering infrastructure.

The strategic stance to cost was to

Raise awareness of costs
Constraint centralised support, focus specially into JVM, and of course for Docker
Prioritise by impact
Seek reusable solutions

How do we achieve velocity with worrying about breaking things all the time?

Global cloud management and delivery, Spinnaker an automated delivery system.

Conformity checks
Red/black pipelines
Automated canaries
Staged deployments (one region at a time)
Squeeze tests

Organisation and architecture

Electronic Delivery, NRDP 1.x. Wasn’t called Streaming yet.

Simple UI, “Queue Reader”
Collaborative design
XML payloads
Custom responses
Versioned firmware releases
Long cycles

In parallel the Netflix API, let a 1000 flowers bloom. It wasn’t very successful but after that, it started to be used privately

Content Metadata
General REST API
JSON schema
HTTP response codes
OAuth security model (it was important for 3rd party apps)

Hybrid architecture, now we have these two edge services functioning in very different ways. Distinct in

Services
Protocols
Schemas
Security

There was a lot of friction between teams, as client developer you would have to change context all the time.

Josh: What is the right long term architecture? Peter: Do you care about the organisational implications?

Conway’s law

Any piece of software reflects the organisational structure that produce it.

If you have four teams working on a compiler you will end up with a four pass compiler.

This is not solutions first, this is organisation first.

Outcomes and lessons

By unifying these things around the client.

Outcomes

Productivity and new capabilities
Refactored organisation

Lessons

Solutions first, team second
Reconfigure teams to best support your architecture

Microservice architecture as complex and organic.

Health depends on discipline and injecting chaos.

Dependency

Circuit breakers, fallbacks, chaos
Simple clients
Eventual consistency

Scale

Auto-scaling
Redundancy, avoid SPoF
Partioned workloads
Failure-driven design
Chaos under load

Variance

Engineered operations
Understood cost of variance
Prioritised support by impact

Change

Automated delivery
Integrated practices

Organisation and architecture

Solutions first, team second