Hi,
I'm Tom.

I'm a Python web developer living in Shanghai. This is my personal website and blog.

Blog  ·  Github  ·  Twitter
Recent (Full archive →)
  1. Microservice health check in Kubernetes
  2. Build a simple protocol over TCP
  3. Cassandra: A Journey of Upgrade
  4. Ramble on Java & Session
  5. Cassandra: Create a cluster on your local machine

Site designed by @orourkedesign.

Microservice health check in Kubernetes

TL;DR

Service should provide a standard endpoint for the purpose of health check and monitoring. The specification for the endpoint should conform to the requirements as elaborated in section Requirements.

Background

what is health check

A health check detects the healthy status of a service, reporting whether the service is able to handle requests or whether the service is in a bad state and should be restarted.

Why health check is needed

High availability

There are many cases when a service is started/restarted

Under these circumstances, if a request is forwared to a service that is still in the middle of its starting/restarting process, it would probably fail. So we need to make sure a service is healthy to accept requests before adding it to the load balancer(kubernetes service), such that we could reduce the service down time and achieve high availability.

Service stability

Service running for a long period of time may fall into a bad state, in which service is unable to handle requests properly. In this case, service needs to be prohibited from receiving requests, until it is recovered either via restart or manual resurrection. Thus our service in all is stable.

Monitoring

A big part of the DevOps responsibilities is to monitor and maintain the health of running services. If a service goes down, appropriate actions should be undertaken to bring the service back to life. Health check informs the DevOps whether the service is malfunctioning.

Clients of health checks

Downsides of health check

As health check is done periodically, not in a real time manner, there still could be time gap before the unhealthy state is known to the clients. To mitigate the effect of this situation, a reasonable checking period should be set.

Requirements

What should be checked

As the definition of healthy may vary from service to service, depending on the service application logics, there could be many levels of healthy:

Each service may define its own criteria, however the result of these checks should be certain, ie, the service is either healthy or not healthy, there should be no middle state.

How to expose health check to clients

How health check respond to clients

Status code

Response body

Response body can be empty, however attaching additional information of what is checked and the result of the check is preferred

Security/Access control

The health check should be private and limited to internal access, however if it is open to public access:

Implementation

Examples

Service OK

1
2
3
4
5
6
7
$ curl -XGET http://127.0.0.1:9000/health
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
 
{
    "status": "UP"
}

Service Unavailable

1
2
3
4
5
6
7
$ curl -XGET http://127.0.0.1:9000/health
HTTP/1.1 503 Service Unavailable
Content-Type: application/json; charset=utf-8
 
{
    "status": "Down"
}

Authenticated access

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ curl -XGET http://127.0.0.1:9000/health -H 'Authorization: Basic ZnNfbm9ybWFsOkBDZ0JkSjZOKz9TbmQhRytIJEI3'
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
 
{ 
  "status":"UP",
  "fooService":{ 
    "status":"UP",
    "description":"Foo service"
  },
  "mysql":{ 
    "status":"UP",
    "description":"MySQL Database",
    "hello":1
  }
}

Libraries

Java

Spring Boot Actuator

Go

N/A

Client Integration

Kubernetes integration

Please refer to https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/

Readiness and liveness probes can be used in parallel for the same container. Using both can ensure that traffic does not reach a container that is not ready for it, and that containers are restarted when they fail.

Readiness Probe

1
2
3
4
5
6
7
8
9
readinessProbe: # check if service in a healthy state, will remove pod from service/loadbalancer if probe failed
    httpGet:
        path: /health
        port: 9000
    initialDelaySeconds: 10 # start checking after 10s after pod starts. should set to a minimal value such that service able to receive requests as soon as it is ready
    periodSeconds: 10 # check health check api every 10 seconds
    timeoutSeconds: 3 # if response time is logger than 3 seconds, we consider the check as failed
    failureThreshold: 3  # if check fails for 3 times in a row, we consider the pod is in a bad state, pod will be restarted
    successThreshold: 1 # if check succeeds for once, we consider the pod is back to normal

Liveness Probe

1
2
3
4
5
6
7
8
9
livenessProbe: # check if pod is in a bad state, will restart pod if probe failed
    httpGet:
        path: /health
        port: 9000
    initialDelaySeconds: 180 # start checking after 180s after pod starts, should be logger than service start time. Some service takes minutes to start, so we set a big value here.
    periodSeconds: 10 # check health check api every 10 seconds
    timeoutSeconds: 3 # if response time is logger than 3 seconds, we consider the check as failed
    failureThreshold: 3 # if check fails for 3 times in a row, we consider the pod is in a bad state, pod will be restarted
    successThreshold: 1 # if check succeeds for once, we consider the pod is back to normal

Prometheus integration

Prometheus keeps polling health API constantly and store the result in its time series database. If health check metrics match a predefined alert rule, a alert will be triggered.

Scrape config

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
job_name: 'health-check'
  metrics_path: /probe
  params:
    module: [http_2xx]  # Look for a HTTP 200 response.
  kubernetes_sd_configs:
  - role: service
 
  relabel_configs:
    - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_healthcheck]
      regex: true
      action: keep
    - source_labels: [__meta_kubernetes_service_name]
      target_label: service
    - source_labels: [__address__]
      regex: (.*)(:80)?
      target_label: __param_target
      replacement: ${1}/health
    - source_labels: [__param_target]
      regex: (.*)
      target_label: instance
      replacement: ${1}
    - source_labels: []
      regex: .*
      target_label: __address__
      replacement: blackbox-exporter-service:9115  # Blackbox exporter.

Service annotation

Add prometheus.io/healthcheck annotation to Kubernetes service so that they could be discovered by the health check job.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/healthcheck: "true"
  name: foo-service
  namespace: foo
  labels:
    app: foo-service
spec:
  ports:
  - port: 80
    targetPort: 8000
    protocol: TCP
  selector:
    app: foo

Blackbox exporter config

Config a http_2xx module to scrape health api

1
2
3
4
5
6
7
8
9
10
11
12
13
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_status_codes: []  # Defaults to 2xx
      method: GET
      headers: {}
      no_follow_redirects: false
      fail_if_ssl: false
      fail_if_not_ssl: false
      fail_if_matches_regexp: []
      fail_if_not_matches_regexp: []
 

Build a simple protocol over TCP

Disclaimer: I am not an expert of TCP or designing protocols, this post is just about my learning experience of building a protocols over TCP :)

A rookie mistake

When I was playing with sockets. A rookie mistake I made is assuming that each message send implies a message receive, like the following example:

server.py

1
2
3
4
5
6
7
8
9
10
11
import socket

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.bind((socket.gethostname(), 2333))
sock.listen(1)
connection, address = sock.accept()
while True:
    data = connection.recv(1024)
    if not data:
        break
    print "Received: %s" % data

client.py

1
2
3
4
5
6
7
import socket

socket_address = socket.gethostname(), 2333
connection = socket.create_connection(socket_address)
connection.send("Hello there!")
connection.send("Bye bye!")
connection.close()

Run the client. I got the following output:

1
Receive: Hello there!Bye bye!

.

So two sends result in one receive, not two receives as expected. Hah. This is a misunderstanding of how TCP works.

TCP is a stream oriented protocol, not a packet/message oriented protocol like UDP. I’d like to use this analogy: TCP is like making a phone call, a connection must be established before both end is able to talk, and when you talk, data stream flows on the connection. While UDP is like you’re sending a text message.

The boundary

However this rookie mistake got me thinking, when we’re building an application on top of TCP socket, for example, a chatting application, how do we know where each message ends since they are a stream of data? Where’s the boundary of two messages? There must be something up on the application level.

1. Delimiter

Back to the phone call analogy, let’s say foo is reading a poem to bar over the phone, how does bar know when foo finishes a line? how does bar know if foo finishes the whole poem? Does the wired connection do that for you? NO. But what we know from common sense is that, there’s a pause when you finish a line, and maybe a longer pause when you finish the poem. Similarly, maybe we can put a pause in the end of each message? Just like \r\n in HTTP headers.

Here is an improved version of the previous code using \r\n as the delimiter:

server.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import socket

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.bind((socket.gethostname(), 2333))
sock.listen(1)
connection, address = sock.accept()

while True:
    data = connection.recv(1024)
    if not data:
        break
    else:
        for line in data.split("\r\n"):
            if line:
                print "Received: %s" % line

client.py

1
2
3
4
5
6
7
import socket

socket_address = socket.gethostname(), 2333
connection = socket.create_connection(socket_address)
connection.send("Hello there!\r\n")
connection.send("Bye bye!\r\n")
connection.close()

Now we got the separate output:

1
2
Received: Hello there!
Received: Bye bye!

The downside of this approach is that, when dealing with a message that is longer than 1024, you just get part of the message. We might need a buffer to receive message until we get a delimiter.

2. Fix length or Prefix length

What if messages are all in fix length? Short message can be filled with empty string, something like:

1
connection.send("Hello there!".ljust(140))

So server just need to keep reading fix length of bytes from socket. This works. However there is still a hard limit on the length of the message.

What if we tell the server the length of each message beforehand? We can do that by prefixing the message with the length of it. Yes! Just like the “Content-Length” header in HTTP.

1
2
3
make_message = lambda x: str(len(x)).ljust(4) + x
connection.send(make_message("Hello there!"))
connection.send(make_message("Bye bye!"))

Here we prefix each message 4 bytes string indicating the length of the message. And server will first read the 4 bytes to get the length, then read as much bytes as that. The recvall function is to get the certain length of data, otherwise with simply recv, there’s a chance we get just part of the transmitted data. Although in local machine the chance is low.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
connection, address = sock.accept()

def recvall(conn, remains):
    buf = ""
    while remains:
        data = conn.recv(remains)
        if not data:
            break
        buf += data
        remains -= len(data)
    return buf

while True:
    data = recvall(connection, 4)
    if not data:
        break
    length = int(data)
    message = recvall(connection, length)

    print "Received: %s" % message

At this point, we have something like a protocol over the TCP layer, which is able to achieve the original goal.

Native protocol of Cassandra

Now that we have a protocol of our own, although simple and naive, I’d like to take a look at some serious protocol that built on TCP. Since I’ve been working with Cassandra a lot lately. I might as well just check their protocol.

CQL is the protocol of Cassandra, which is built on TCP:

The CQL binary protocol is a frame based protocol. Frames are defined as:

  0         8        16        24        32
  +---------+---------+---------+---------+
  | version |  flags  | stream  | opcode  |
  +---------+---------+---------+---------+
  |                length                 |
  +---------+---------+---------+---------+
  |                                       |
  .            ...  body ...              .
  .                                       .
  .                                       .
  +----------------------------------------

Frames can be regarded as what we called messages in previous examples. Except the first 32 bits, the length and body part is just what we used. So our approach looks practical.

So that’s it, there must be more technical details regarding building a full-fledged protocol, but some fundamental things should work the same.

 

Cassandra: A Journey of Upgrade

For the past few couple months, a huge burden on my shoulder had been upgrading our Cassandra cluster from 1.2.6 to 2.1. I’ve been investing a lot of working hours to figure out the solution. Now that it has been done, I feel it is worthwhile to write down the whole experience.

Why Upgrade?

Actually the imperative reason is that we need transaction support in one of our services. And Cassandra 2.0 introduced a new feature called light-weight transaction, although it is light-weight, it somehow can fix our issue.

Besides that, there are also a couple of new features we can benefit from the upgrade:

Infrastructure

Upgrade Path

Driver upgrade

We’re using a fairly old driver Cassandra called Pycassa, which is no longer maintained. And it is based on thrift protocol, which is deprecated/ditched in the version 3, so all the new and good stuff on the native protocol has nothing to do with Pycassa. Very naturally we switched to the recommended/official driver maintained by the Datastax.

Internally we don’t have a layer for Cassandra, so refactoring is a lot of pain. We have to replace all the code usages of Pycassa among all services, and carefully update all unit tests.

We also bumped into some issues when deploying with the new driver:

The driver upgrade is not as smooth as I thought. A lot of back and forth happened and it took us almost two month or so to ship the upgrade.

No rolling upgrade?

Rolling upgrade should be a default option for a cluster upgrade. But unfortunately it is not supported between major versions of Cassandra. As it is documented here. We thought about workarounds. Like building a new Cluster and syncing data between two clusters. But building a new cluster is not our option due to some “policy”, so we decided that we can tolerate some downtime, and that also means we will update each Cassandra instance in place.

Data backup and restore

It’s important to have a backup of the data. In case something goes wrong, we can go back to the save point. When doing data backup, we demand that all services that access Cassandra should be stopped and keep data untouched during the process.

Below is a typical structure of one of our Cassandra nodes:

/mnt/cassandra/

── commitlog_directory

── data_file_directories

“data_file_directories” is where Cassandra data files live, our goal is to backup this directory. We’ll do a ‘nodetool drain’ on the node, which will flush all memtables to data files. After that We’ll pack data_file_directories into one tarball and upload it to the cloud(to prevent disk failure of node). So we’ll have two copies of data.

Procedure:

If something goes wrong and we want to abort the upgrade and go back to the old version. We simply retrieve the old data and unpack it to the Cassandra data file directory.

The backup and restore procedure are automated by Ansible scripts.

Upgrade

Upgrade directly from current version 1.2.6 to 2.1 is not possible. Since pre-2.0 SSTables are not supported by 2.1. A direct upgrade to 2.1, Cassandra would fail to start and following error would be raised:

1
2
3
4
5
6
7
8
9
10
11
java.lang.RuntimeException: Incompatible SSTable found. Current version ka is unable to read file: /var/lib/cassandra/data/system/schema_keyspaces/system-schema_keyspaces-ic-1. Please run upgradesstables.
        at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:443) ~[apache-cassandra-2.1.1.jar:2.1.1]
        at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:420) ~[apache-cassandra-2.1.1.jar:2.1.1]
        at org.apache.cassandra.db.Keyspace.initCf(Keyspace.java:327) ~[apache-cassandra-2.1.1.jar:2.1.1]
        at org.apache.cassandra.db.Keyspace.<init>(Keyspace.java:280) ~[apache-cassandra-2.1.1.jar:2.1.1]
        at org.apache.cassandra.db.Keyspace.open(Keyspace.java:122) ~[apache-cassandra-2.1.1.jar:2.1.1]
        at org.apache.cassandra.db.Keyspace.open(Keyspace.java:99) ~[apache-cassandra-2.1.1.jar:2.1.1]
        at org.apache.cassandra.db.SystemKeyspace.checkHealth(SystemKeyspace.java:558) ~[apache-cassandra-2.1.1.jar:2.1.1]
        at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:214) [apache-cassandra-2.1.1.jar:2.1.1]
        at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:443) [apache-cassandra-2.1.1.jar:2.1.1]
        at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:532) [apache-cassandra-2.1.1.jar:2.1.1]

So we upgraded to 2.0.0 and run upgradesstables command to upgrade SSTables. After that, we then upgrade from 2.0.0 to 2.1.

Cassandra has an internal version for SSTables. During the upgrade, sstable version will be bumping from:

1
ic (1.2.6) --> ja (2.0.0) --> ka(2.1.3)

Procedure:

The procedure looks simple and clear. While we had couple issue when doing test upgrade:

Data consistency

How to ensure data are not corrupted during the upgrade? I think this should be guaranteed by Cassandra. However when doing upgrade testing, we have a script to dump all Cassandra data before and after the upgrade to ensure data are not touched. This step is taken away when we’re doing actual upgrade.

 
View the full archive →