1 of 9

Concepts

Deployment - Key Concepts

This section contains a list of documents elaborating on the key concepts aiding the deployment of the DIGIT platform

Security Practices

DIGIT being a container-based platform and orchestrated on Kubernetes, this page discusses some key security practices to protect the infrastructure

On this page:

Introduction

Security is always a difficult subject to approach either by lack of experience; or by the fact you should know when the level of security is right for what you have to secure.

Security is a major concern when it comes to government systems and infra. As an architect, we can consider that working with technically educated people (engineers, experts) and tools (systems, frameworks, IDE) should prevent key VAPT issues.

However, it’s quite difficult to avoid, a certain infatuation from different categories of people to try to hack the systems.

Infra Security

1. Update to the latest version

There aren’t only bug fixes in each release but also new security measures to require advantage of them, we recommend working with the newest stable version.

Updates and support could also be harder than the new features offered in releases, so plan your updates a minimum of once a quarter. Significantly simplified updates can utilize the providers of managed Kubernetes solutions.

2. Enable role-based access control (RBAC)

Use RBAC (Role-Based Access Control) to regulate who can access and what rights they need. Usually, RBAC is enabled by default in version 1.6 and later (or later for a few providers), but if you’ve got been updated since then and didn’t change the configuration, you ought to double-check your settings.

However, enabling RBAC isn’t enough — it still must be used effectively. within the general case, the rights to the whole cluster (cluster-wide) should be avoided, giving preference to rights in certain namespaces. Avoid giving someone cluster administrator privileges even for debugging — it’s much safer to grant rights only necessary and from time to time.

If the appliance requires access to the Kubernetes API, create separate service accounts. and provides them with the minimum set of rights required for every use case. This approach is far better than giving an excessive amount of privilege to the default account within the namespace.

3. Use namespaces to set security boundaries

Creating separate namespaces is vital because of the first level of component isolation. it’s much easier to regulate security settings — for instance, network policies — when different types of workloads are deployed in separate namespaces.

4. Separate sensitive workloads

A good practice to limit the potential consequences of compromise is to run workloads with sensitive data on a fanatical set of machines. This approach reduces the risk of a less secure application accessing the application with sensitive data running in the same container executable environment or on the same host.

For example, a Kubelet of a compromised node usually has access to the contents of secrets only if they are mounted on pods that are scheduled to be executed on the same node. If important secrets are often found on multiple cluster nodes, the attacker will have more opportunities to urge them.

Separation can be done using node pools (in the cloud or on-premises), as well as Kubernetes controlling mechanisms, such as namespaces, taints, tolerations, and others.

5. Protect access to cloud service metadata

Sensitive metadata — for instance, Kubelet administrative credentials, are often stolen or used with malicious intent to escalate privileges during a cluster. For example, a recent find within Shopify’s bug bounty showed in detail how a user could exceed authority by receiving metadata from a cloud provider using specially generated data for one of the microservices.

The GKE metadata concealment function changes the mechanism for deploying the cluster in such how that avoids such a drag. And we recommend using it until a permanent solution is implemented.

6. Create and define cluster network policies

Network Policies — allow you to control access to the network in and out of containerized applications. To use them, you must have a network provider with support for such a resource. For managed Kubernetes solution providers such as Google Kubernetes Engine (GKE), support will need to be enabled.

Once everything is ready, start with simple default network policies — for example, blocking (by default) traffic from other namespaces.

7. Set the Pod Security Policy for the cluster

Pod Security Policy sets the default values used to start workloads in the cluster. Consider defining a policy and enabling the Pod Security Policy admission controller: the instructions for these steps vary depending on the cloud provider or deployment model used.

In the beginning, you might want to disable the NET_RAW capability in containers to protect yourself from certain types of spoofing attacks.

8. Work on node security

To improve host security, you can follow these steps:

Ensure that the host is securely and correctly configured. One way is CIS Benchmarks; Many products have an auto checker that automatically checks the system for compliance with these standards.
Monitor the network availability of important ports. Ensure that the network is blocking access to the ports used by Kubelet, including 10250 and 10255. Consider restricting access to the Kubernetes API server — with the exception of trusted networks. In clusters that did not require authentication and authorization in the Kubelet API, attackers used to access to such ports to launch cryptocurrency miners.
Minimize administrative access to Kubernetes hosts Access to cluster nodes should in principle be limited: for debugging and solving other problems, as a rule, you can do without direct access to the node.

9. Enable Audit Logging

Make sure that audit logs are enabled and that you are monitoring for the occurrence of unusual or unwanted API calls in them, especially in the context of any authorization failures — such entries will have a message with the “Forbidden” status. Authorization failures can mean that an attacker is trying to take advantage of the credentials obtained.

Managed solution providers (including GKE) provide access to this data in their interfaces and can help you set up notifications in case of authorization failures.

Conclusion

Readiness & Liveness

Overview of various probes that we can setup to ensure the service deployment and the availability of the service is ensured automatically.

On this page:

Probes overview
Kubernetes probes
Readiness probes
Liveness probes
Startup probes
Configuring probes actions
Best practices
Tools

Probes Overview

Determining the state of a service based on readiness, liveness, and startup to detect and deal with unhealthy situations. It may happen that the application needs to initialize some state, make database connections, or load data before handling application logic. This gap in time between when the application is actually ready versus when Kubernetes thinks is ready becomes an issue when the deployment begins to scale and unready applications receive traffic and send back 500 errors.

Many developers assume that when basic pod setup is adequate, especially when the application inside the pod is configured with daemon process managers (e.g. PM2 for Node.js). However, since Kubernetes deems a pod as healthy and ready for requests as soon as all the containers start, the application may receive traffic before it is actually ready.

Kubernetes Probes

Kubernetes supports readiness and liveness probes for versions ≤ 1.15. Startup probes were added in 1.16 as an alpha feature and graduated to beta in 1.18 (WARNING: 1.16 deprecated several Kubernetes APIs. Use this migration guide to check for compatibility).

All the probes have the following parameters:

initialDelaySeconds : number of seconds to wait before initiating liveness or readiness probes
periodSeconds: how often to check the probe
timeoutSeconds: number of seconds before marking the probe as timing out (failing the health check)
successThreshold : minimum number of consecutive successful checks for the probe to pass
failureThreshold : number of retries before marking the probe as failed. For liveness probes, this will lead to the pod restarting. For readiness probes, this will mark the pod as unready.

Readiness Probes

Readiness probes are used to let Kubelet know when the application is ready to accept new traffic. If the application needs some time to initialize the state after the process has started, configure the readiness probe to tell Kubernetes to wait before sending new traffic. A primary use case for readiness probes is directing traffic to deployments behind a service.

One important thing to note with readiness probes is that it runs during the pod’s entire lifecycle. This means that readiness probes will run not only at startup but repeatedly throughout as long as the pod is running. This is to deal with situations where the application is temporarily unavailable (i.e. loading large data, waiting on external connections). In this case, we don’t want to necessarily kill the application but wait for it to recover. Readiness probes are used to detect this scenario and not send traffic to these pods until it passes the readiness check again.

Liveness Probes

Liveness probes are used to restart unhealthy containers. The Kubelet periodically pings the liveness probe, determines the health, and kills the pod if it fails the liveness check.

Liveness checks can help the application recover from a deadlock situation. Without liveness checks, Kubernetes deems a deadlocked pod healthy since the underlying process continues to run from Kubernetes’s perspective. By configuring the liveness probe, the Kubelet can detect that the application is in a bad state and restarts the pod to restore availability.

Startup Probes

Startup probes are similar to readiness probes but only executed at startup. They are optimized for slow-starting containers or applications with unpredictable initialization processes. With readiness probes, we can configure the initialDelaySeconds to determine how long to wait before probing for readiness. Now consider an application where it occasionally needs to download large amounts of data or do an expensive operation at the start of the process. Since initialDelaySeconds is a static number, we are forced always to take the worst-case scenario (or extend the failureThreshold one that may affect long-running behaviour) and wait for a long time even when that application does not need to carry out long-running initialization steps. With startup probes, we can instead configure failureThreshold and periodSeconds to model this uncertainty better. For example, setting failureThreshold to 15 and periodSeconds to 5 means the application will get 10 x 5 = 75s to startup before it fails.

Configuring Probe Actions

Now that we understand the different types of probes, we can examine the three distinct ways to configure each probe.

HTTP

The Kubelet sends an HTTP GET request to an endpoint and checks for a 2xx or 3xx response. You can reuse an existing HTTP endpoint or set up a lightweight HTTP server for probing purposes (e.g. an Express server with /healthz endpoint).

HTTP probes take in additional parameters:

host : hostname to connect to (default: pod’s IP)
scheme : HTTP (default) or HTTPS
path : path on the HTTP/S server
httpHeaders : custom headers if you need header values for authentication, CORS settings, etc
port : name or number of the port to access the server

livenessProbe:
   httpGet:
     path: /healthz
     port: 8080

TCP

To check whether or not a TCP connection can be made, you can specify a TCP probe. The pod is marked healthy if it can establish a TCP connection. Using a TCP probe may be useful for a gRPC or FTP server where HTTP calls may not be suitable.

readinessProbe:
   tcpSocket:
     port: 21

Command

Finally, a probe can be configured to run a shell command. The check passes if the command returns with exit code 0; otherwise, the pod is marked as unhealthy. This type of probe may be useful if it is not desirable to expose an HTTP server/port or if it is easier to check initialization steps via command (e.g. check if a configuration file has been created, run a CLI command).

readinessProbe:
   exec:
     command: ["/bin/sh", "-ec", "vault status -tls-skip-verify"]

Best Practices

The exact parameters for the probes depend on your application, but here are some general best practices to get started:

For older (≤ 1.15) Kubernetes clusters, use a readiness probe with an initial delay to deal with the container startup phase (use p99 times for this). But make this check lightweight since the readiness probe will execute throughout the entire lifecycle of the pod. We don’t want the probe to time out because the readiness check takes a long time to compute.
For newer (≥ 1.16) Kubernetes clusters, use a startup probe for applications with unpredictable or variable startup times. The startup probe may share the same endpoint (e.g. /healthz ) as the readiness and liveness probes, but set the failureThreshold higher than the other probes to account for longer start times, but more reasonable time to failure for liveness and readiness checks.
Readiness and liveness probes may share the same endpoint if the readiness probes aren’t used for other signalling purposes. If there’s only one pod (i.e. using a Vertical Pod Autoscaler), set the readiness probe to address the startup behaviour and use the liveness probe to determine health. In this case, marking the pod unhealthy means downtime.
Readiness checks can be used in various ways to signal system degradation. For example, if the application loses connection to the database, readiness probes may be used to temporarily block new requests and allow the system to reconnect. It can also be used to load balance work to other pods by marking busy pods as not ready.

In short, well-defined probes generally lead to better resilience and availability. Be sure to observe the startup times and system behaviour to tweak the probe settings as the applications change.

Tools

Considering the significance of Kubernetes probes, you can utilize a Kubernetes resource analysis tool to identify any missing probes. These tools can be executed against existing clusters or integrated into the CI/CD pipeline to automatically reject workloads that don't have properly configured resources.

Polaris: a resource analysis tool with a nice dashboard that can also be used as a validating webhook or CLI tool.
Kube-score: a static code analysis tool that works with Helm, Kustomize, and standard YAML files.
Popeye: read-only utility tool that scans Kubernetes clusters and reports potential issues with configurations.

Resource Requests & Limits

“Resource Request” and a “Resource Limit” when defining how many resources a container within a pod should receive

On this page:

Resource requests
Resource limits
Common practices
Resource requests & limits in action

Containerising applications and running them on Kubernetes does not mean we can forget all about resource utilization. Our thought process may have changed because we can easily scale out our application as demand increases. We need to consider frequently how our containers might fight with each other for resources. Resource requests and limits can be used to help stop the “noisy neighbour” problem in a Kubernetes Cluster.

Resource Requests

To put things simply, a resource request specifies the minimum amount of resources a container needs to successfully run. Thought of in another way, this is a guarantee from Kubernetes that you’ll always have this amount of either CPU or Memory allocated to the container.

Why would you worry about the minimum amount of resources guaranteed to a pod? Well, it's to help prevent one container from using up all the node’s resources and starving the other containers from CPU or memory. For instance, if I had two containers on a node, one container could request 100% of that node's processor. Meanwhile, the other container would likely not be working very well because the processor is being monopolized by its “noisy neighbour”.

What a resource request can do, is to ensure that at least a small part of that processor’s time is reserved for both containers. This way if there is resource contention, each pod will have a guaranteed, minimum amount of resources in which to still function.

Resource Limits

As you might guess, a resource limit is the maximum amount of CPU or memory that can be used by a container. The limit represents the upper bounds of how much CPU or memory that a container within a pod can consume in a Kubernetes cluster, regardless of whether or not the cluster is under resource contention.

Limits prevent containers from taking up more resources on the cluster than you’re willing to let them.

Common Practices

As a general rule, all containers should have a request for memory and CPU before deploying to a cluster. This will ensure that if resources are running low, your container can still do the minimum amount of work to stay in a healthy state until the resources free up again (hopefully).

Limits are often used in conjunction with requests to create a “guaranteed pod”. This is where the request and limit are set to the same value. In that situation, the container will always have the same amount of CPU available to it, no more or less.

At this point, you may be thinking about adding a high “request” value to make sure you have plenty of resources available for your container. This might sound like a good idea but have dramatic consequences for scheduling on the Kubernetes cluster. If you set a high CPU request, for example, 2 CPUs, then your pod will ONLY be able to be scheduled on Kubernetes nodes that have 2 full CPUs available that aren’t reserved by other pods’ requests. In the example below, the 2 vCPU pods couldn’t be scheduled on the cluster. However, if you were to lower the “request” amount to say 1 vCPU, it could.

Resource Requests and Limits – In Action

CPU Limit Example

Let us try out using a CPU limit on a pod and see what happens when we try to request more CPU than we’re allowed to have. Before we set the limit though, let us look at a pod with a single container under normal conditions. I’ve deployed a resource consumer container in my cluster and by default, you can see that I am using 1m CPU(cores) and 6 Mi(bytes) of memory.

NOTE: CPU is measured in millicores so 1000m = 1 CPU core. Memory is measured in Megabytes.

Ok, now that we have seen the “no-load” state, let us add some CPU load by making a request to the pod. Here, we increased the CPU usage on the container to 400 millicores.

After the metrics start coming in, you can see that we got roughly 400m used on the container as you’d expect to see.

Now we have deleted the container and will edit the deployment manifest so that it has a limit on CPU.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    run: resource-consumer
  name: resource-consumer
  namespace: default
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      run: resource-consumer
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      labels:
        run: resource-consumer
    spec:
      containers:
      - image: theithollow/resource-consumer:v1
        imagePullPolicy: IfNotPresent
        name: resource-consumer
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        resources:
          requests:
            memory: "100Mi"
            cpu: "100m"
          limits:
            memory: "300Mi"
            cpu: "300m"
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30

After redeploying the container and again increasing the CPU load to 400m, we can see that the container is throttled to 300m instead. We have effectively “limited” the resources the container could consume from the cluster.

CPU Requests Example

Next, we deployed two pods into the Kubernetes cluster and those pods are on the same worker node for a simple example of contention. We have got a guaranteed pod that has 1000m CPU set as a limit but also as a request. The other pod is unbounded, meaning there is no limit on how much CPU it can utilize.

After the deployment, each pod is really not using any resources as you can see here.

We make a request to increase the load on the non-guaranteed pod.

And if we look at the container's resources you can see that even though the container wants to use a 2000m CPU, it is actually using a 1000m CPU only. The reason for this is that the guaranteed pod is guaranteed a 1000m CPU, whether it is actively using that CPU or not.

Summary

Kubernetes uses resource requests to set a minimum amount of resources for a given container so that it can be used if it needs it. You can also set a resource limit to set the maximum amount of resources a pod can utilize.

Taking these two concepts and using them together can ensure that your critical pods always have the resources that they need to stay healthy. They can also be configured to take advantage of shared resources within the cluster.

Be careful setting resource requests too high so your Kubernetes scheduler can still schedule these pods. Good luck!

Deploying DIGIT Services

Watch the video below to access information on how to deploy DIGIT services on Kubernetes, and prepare deployment manifests for various services along with their configurations and secrets. etc. It also discusses the maintenance of environment-specific changes.

Deployment Architecture

On this page:

Sample Kubernetes architecture
DIGIT deployment architecture
CI/CD flow
Deployment scripts

Overview

This section contains architectural details about DIGIT deployment. It discusses the various activities in a sequence of steps to provision required infra and deploy DIGIT.

Sample Kubernetes Architecture

DIGIT Deployment Architecture

CI/CD Flow

Every code commit is well-reviewed and squash merge to branches through Pull Requests.
Trigger the CI Pipeline that ensures code quality, vulnerability assessments, and CI tests before building the artefacts.
Artefact is version controlled based on Semantic versioning based on the nature of the change.
After successful CI, Jenkins bakes the Docker Images with the versioned artefacts and pushes the baked Docker image to Docker Registry.
Deployment Pipeline pulls the built Image and pushes it to the corresponding environment.

Deployment Scripts

As all the DIGIT services are containerized and deployed on Kubernetes, we need to prepare deployment manifests. The same can be found here.
DIGIT has built helm charts using the standard helm approach to ease managing the service-specific configs, customisations, switch/toggle, secrets, etc.
Golang base Deployment script that reads the values from the helm charts template and deploys into the cluster.
Each env will have one master yaml template that will have the definition of all the services to be deployed, and their dependencies like Config, Env, Secrets, DB Credentials, Persistent Volumes, Manifest, Routing Rules, etc.

Routing Traffic

On this page:

Overview
Terminologies used
Ingress
Pre-requisites
Ingress resource
Ingress rules

Overview

In Kubernetes, an Ingress is an object that allows access to your Kubernetes services from outside the Kubernetes cluster. You configure access by creating a collection of rules that define which inbound connections reach which services.

This lets you consolidate your routing rules into a single resource. For example, you might want to send requests to example.com/api/v1/ to an api-v1 service, and requests to example.com/api/v2/ to the api-v2 service. With an Ingress, you can easily set this up without creating a bunch of LoadBalancers or exposing each service on the Node.

An API object that manages external access to the services in a cluster, typically HTTP. Ingress may provide load balancing, SSL termination and name-based virtual hosting.

Terminology

For clarity, this guide defines the following terms:

Node: A worker machine in Kubernetes, part of a cluster.
Cluster: A set of Nodes that run containerized applications managed by Kubernetes. For this example, and in most common Kubernetes deployments, nodes in the cluster are not part of the public internet.
Edge router: A router that enforces the firewall policy for your cluster. This could be a gateway managed by a cloud provider or a physical piece of hardware.
Cluster network: A set of links, logical or physical, that facilitate communication within a cluster according to the Kubernetes networking model.
Service: A Kubernetes Service that identifies a set of Pods using label selectors. Unless mentioned otherwise, Services are assumed to have virtual IPs only routable within the cluster network.

Ingress Overview

Ingress exposes HTTP and HTTPS routes from outside the cluster to services within the cluster. Traffic routing is controlled by rules defined on the Ingress resource.

  internet        | [ Ingress ]   --|-----|--   [ Services ]

An Ingress may be configured to give Services externally-reachable URLs, load balance traffic, terminate SSL / TLS, and offer name based virtual hosting. An Ingress controller is responsible for fulfilling the Ingress, usually with a load balancer, though it may also configure your edge router or additional frontends to help handle the traffic.

An Ingress does not expose arbitrary ports or protocols. Exposing services other than HTTP and HTTPS to the internet typically uses a service of type Service.Type=NodePort or Service.Type=LoadBalancer.

Pre-requisites

You must have an ingress controller to satisfy an Ingress. Only creating an Ingress resource has no effect.

You may need to deploy an Ingress controller such as ingress-nginx. You can choose from a number of Ingress controllers.

Ideally, all Ingress controllers should fit the reference specification. In reality, the various Ingress controllers operate slightly differently.

Ingress Resource

An Ingress resource example:

apiVersion: extensions/v1beta1kind: Ingressmetadata:  annotations:    kubernetes.io/ingress.class: nginx  name:  service1  namespace: egovspec:  rules:  - host: foo.bar.com    http:      paths:      - backend:          serviceName:  service1          servicePort: 8080        path: /foo  tls:  - hosts:    - foo.bar.com    secretName: foo.bar.com-tls-certs

As with all other Kubernetes resources, an Ingress needs apiVersion, kind, and metadata fields. The name of an Ingress object must be a valid DNS subdomain name. For general information about working with config files, see deploying applications, configuring containers, managing resources. Ingress frequently uses annotations to configure some options depending on the Ingress controller, an example of which is the rewrite-target annotation. Different Ingress controller support different annotations. Review the documentation for your choice of Ingress controller to learn which annotations are supported.

The Ingress spec has all the information needed to configure a load balancer or proxy server. Most importantly, it contains a list of rules matched against all incoming requests. Ingress resource only supports rules for directing HTTP(S) traffic.

Ingress Rules

Each HTTP rule contains the following information:

An optional host. In this example, no host is specified, so the rule applies to all inbound HTTP traffic through the IP address specified. If a host is provided (for example, foo.bar.com), the rules apply to that host.
A list of paths (for example, /testpath), each of which has an associated backend defined with a service.name and a service.port.name or service.port.number. Both the host and path must match the content of an incoming request before the load balancer directs traffic to the referenced Service.
A backend is a combination of Service and port names as described in the Service doc or a custom resource backend by way of a CRD. HTTP (and HTTPS) requests to the Ingress that matches the host and path of the rule are sent to the listed backend.

A default backend is often configured in an Ingress controller to service any requests that do not match a path in the spec.

Learn about the Ingress API
Learn about Cert-manager

Backbone Deployment

On this page:

Overview

Once the cluster is ready and healthy you can start deploying backbones services.

Deploy configuration and deployment in the following Services Lists

Backbone (Redis, ZooKeeper-v2, Kafka-v2,elasticsearch-data-v1, elasticsearch-client-v1, elasticsearch-master-v1)
Gateway (Zuul, nginx-ingress-controller)

Pre-requisites

Understanding of VM Instances, LoadBalancers, SecurityGroups/Firewalls, nginx, DB Instances, and data volumes.
Experience with Kubernetes, Docker, Jenkins, helm, golang, Infra-as-code.

Deployment Steps

Deploy configuration and deployment backbone services:

Modify the global domain and set namespaces create to true

Modify the below-mentioned changes for each backbone service:

Eg. For Kafka-v2

If you are using AWS as a cloud provider, change the respective volume ids and zones. (You will get the volume ids and zone details from either a remote state bucket or from the AWS portal).

Eg. Kafka-v2

If you are using Azure cloud provider, change the diskName and diskUri. (You will get the volume ids and zone details from either the remote state bucket or from the Azure portal)

Eg. Kafka-v2

If you are using ISCSI , change the targetPortal and iqn.

Deploy the backbone services using the go command

cd /eGov-infraOps/egov-deployergo run main.go deploy -e dev -p -c 'kafka-v2,redis,zookeeper-v2,elasticsearch-data-v1,elasticsearch-master-v1,playground,cert-manager,kafka-connect,kafka-connect-restart-tasks,kibana-v1,nginx-ingress'

Modify the “dev” environment name with your respective environment name.

Flags:

e --- Environment name
p --- Print the manifest
c --- Enable Cluster Configs

Check the status of pods

kubectl get pods --all-namespaces

Security Practices

DIGIT being a container-based platform and orchestrated on Kubernetes, this page discusses some key security practices to protect the infrastructure

On this page:

Introduction

Security is always a difficult subject to approach either by lack of experience; or by the fact you should know when the level of security is right for what you have to secure.

However, it’s quite difficult to avoid, a certain infatuation from different categories of people to try to hack the systems.

Infra Security

1. Update to the latest version

There aren’t only bug fixes in each release but also new security measures to require advantage of them, we recommend working with the newest stable version.

2. Enable role-based access control (RBAC)

3. Use namespaces to set security boundaries

To get in-depth knowledge of Kubernetes, enrol for a live demo on the .

4. Separate sensitive workloads

Separation can be done using node pools (in the cloud or on-premises), as well as Kubernetes controlling mechanisms, such as namespaces, taints, tolerations, and others.

5. Protect access to cloud service metadata

The GKE metadata concealment function changes the mechanism for deploying the cluster in such how that avoids such a drag. And we recommend using it until a permanent solution is implemented.

6. Create and define cluster network policies

Once everything is ready, start with simple default network policies — for example, blocking (by default) traffic from other namespaces.

7. Set the Pod Security Policy for the cluster

In the beginning, you might want to disable the NET_RAW capability in containers to protect yourself from certain types of spoofing attacks.

8. Work on node security

To improve host security, you can follow these steps:

Ensure that the host is securely and correctly configured. One way is CIS Benchmarks; Many products have an auto checker that automatically checks the system for compliance with these standards.
Monitor the network availability of important ports. Ensure that the network is blocking access to the ports used by Kubelet, including 10250 and 10255. Consider restricting access to the Kubernetes API server — with the exception of trusted networks. In clusters that did not require authentication and authorization in the Kubelet API, attackers used to access to such ports to launch cryptocurrency miners.
Minimize administrative access to Kubernetes hosts Access to cluster nodes should in principle be limited: for debugging and solving other problems, as a rule, you can do without direct access to the node.

9. Enable Audit Logging

Managed solution providers (including GKE) provide access to this data in their interfaces and can help you set up notifications in case of authorization failures.

Conclusion

Follow these guidelines for a more secure . Remember that even after the cluster is configured securely, you need to ensure security in other aspects of the configuration and operation of containers. To improve the security of the technology stack, study the tools that provide a central system for managing deployed containers, constantly monitoring and protecting containers and cloud-native applications.

Readiness & Liveness

Overview of various probes that we can setup to ensure the service deployment and the availability of the service is ensured automatically.

On this page:

Probes overview
Kubernetes probes
Readiness probes
Liveness probes
Startup probes
Configuring probes actions
Best practices
Tools

Probes Overview

Kubernetes Probes

All the probes have the following parameters:

initialDelaySeconds : number of seconds to wait before initiating liveness or readiness probes
periodSeconds: how often to check the probe
timeoutSeconds: number of seconds before marking the probe as timing out (failing the health check)
successThreshold : minimum number of consecutive successful checks for the probe to pass
failureThreshold : number of retries before marking the probe as failed. For liveness probes, this will lead to the pod restarting. For readiness probes, this will mark the pod as unready.

Readiness Probes

Liveness Probes

Liveness probes are used to restart unhealthy containers. The Kubelet periodically pings the liveness probe, determines the health, and kills the pod if it fails the liveness check.

Startup Probes

Configuring Probe Actions

Now that we understand the different types of probes, we can examine the three distinct ways to configure each probe.

HTTP

HTTP probes take in additional parameters:

host : hostname to connect to (default: pod’s IP)
scheme : HTTP (default) or HTTPS
path : path on the HTTP/S server
httpHeaders : custom headers if you need header values for authentication, CORS settings, etc
port : name or number of the port to access the server

livenessProbe:
   httpGet:
     path: /healthz
     port: 8080

TCP

readinessProbe:
   tcpSocket:
     port: 21

Command

readinessProbe:
   exec:
     command: ["/bin/sh", "-ec", "vault status -tls-skip-verify"]

Best Practices

The exact parameters for the probes depend on your application, but here are some general best practices to get started:

For older (≤ 1.15) Kubernetes clusters, use a readiness probe with an initial delay to deal with the container startup phase (use p99 times for this). But make this check lightweight since the readiness probe will execute throughout the entire lifecycle of the pod. We don’t want the probe to time out because the readiness check takes a long time to compute.
For newer (≥ 1.16) Kubernetes clusters, use a startup probe for applications with unpredictable or variable startup times. The startup probe may share the same endpoint (e.g. /healthz ) as the readiness and liveness probes, but set the failureThreshold higher than the other probes to account for longer start times, but more reasonable time to failure for liveness and readiness checks.
Readiness and liveness probes may share the same endpoint if the readiness probes aren’t used for other signalling purposes. If there’s only one pod (i.e. using a Vertical Pod Autoscaler), set the readiness probe to address the startup behaviour and use the liveness probe to determine health. In this case, marking the pod unhealthy means downtime.
Readiness checks can be used in various ways to signal system degradation. For example, if the application loses connection to the database, readiness probes may be used to temporarily block new requests and allow the system to reconnect. It can also be used to load balance work to other pods by marking busy pods as not ready.

In short, well-defined probes generally lead to better resilience and availability. Be sure to observe the startup times and system behaviour to tweak the probe settings as the applications change.

Tools

Polaris: a resource analysis tool with a nice dashboard that can also be used as a validating webhook or CLI tool.
Kube-score: a static code analysis tool that works with Helm, Kustomize, and standard YAML files.
Popeye: read-only utility tool that scans Kubernetes clusters and reports potential issues with configurations.