1 of 4

Troubleshooting

This section addresses the key areas of concern and its potential remedial steps

Distributed Tracing

We will discuss distributed tracing system Jaeger and how it helps in troubleshooting DIGIT.

Introduction

Distributed tracing is a method used to profile and monitor applications, especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance.

OpenTracing has been a key capability when it comes to microservices-based distributed systems like DIGIT. We’ll start with the introduction of OpenTracing, explaining what it is and why it is important We shall also set up Jaeger and learn to use it for monitoring and troubleshooting.

Drift to Microservice Architecture

Microservice Architecture has now become the obvious choice for application developers. In the Microservice Architecture, a monolithic application is broken down into a group of independently deployed services. In simple words, an application is more like a collection of microservices. When we have millions of such intertwined microservices working together, it’s almost impossible to map the inter-dependencies of these services and understand the execution of a request.

In case of a failure in a monolithic application, it is much easier to understand the path of a transaction and do the root cause analysis with the help of logging frameworks. But in a microservice architecture, logging alone fails to deliver the complete picture.

Is this service the first one in the call chain? How do I span all these services to get insight into the application? With questions like these, it becomes a significantly larger problem to debug a set of interdependent distributed services in comparison to a single monolithic application, making OpenTracing more and more popular.

OpenTracing

The OpenTracing API provides a standard, vendor-neutral framework for instrumentation. This means that if a developer wants to try out a different distributed tracing system, then instead of repeating the whole instrumentation process for the new distributed tracing system, the developer can simply change the configuration of the Tracer.

Here are some basic terminologies of Opentracing:

Span — It represents a logical unit of work that has an operation name, the start time of the operation, and the duration.

Trace — A Trace tells the story of a transaction or workflow as it propagates through a distributed system. It is simply a set of spans sharing a TraceID. Each component in a distributed system contributes its own span.

OpenTracing is a way for services to “describe and propagate distributed traces without knowledge of the underlying OpenTracing implementation.”

Let us take the example of a service like egov-property service (or any other DIGIT service). A service like this requires many other microservices to check that the location is available, proper payment credentials are received, and enough details exist for the ULB to process the property tac. If either one of those microservice fails, then the entire transaction fails. In such a case, having logs just for the main property service wouldn’t be very useful for debugging. However, if you were able to analyze each service you wouldn’t have to scratch your head to troubleshoot which microservice failed and what made it fail.

In real life, applications are even more complex and with the increasing complexity of applications, monitoring the applications has been a tedious task. Opentracing helps us to easily monitor:

Spans of services
Time taken by each service
Latency between the services
Hierarchy of services
Errors or exceptions during execution of each service.

Jaeger: A Distributed Tracing System by Uber

Jaeger is used for monitoring and troubleshooting microservices-based distributed systems, including:

Distributed transaction monitoring
Performance and latency optimization
Root cause analysis
Service dependency analysis
Distributed context propagation

Major Components of Jaeger

Jaeger Client Libraries — Jaeger clients are language-specific implementations of the OpenTracing API.

Agent — The Jaeger agent is a network daemon that listens for spans sent over UDP, which it batches and sends to the collector. It is designed to be deployed to all hosts as an infrastructure component. The agent abstracts the routing and discovery of the collectors away from the client.

Collector — The Jaeger collector receives traces from Jaeger agents and runs them through a processing pipeline. Currently, the pipeline validates traces, indexes them, performs transformations, and finally, stores them. Jaeger’s storage is a pluggable component which currently supports Cassandra, Elasticsearch, and Kafka.

Query — Query is a service that retrieves traces from storage and hosts a UI to display them.

Ingester — Ingester is a service that reads from Kafka topic and writes to another storage backend (Cassandra, Elasticsearch).

Running Jaeger in a Docker Container

First, install Jaeger Client on your machine:
Now, let’s run Jaeger backend as an all-in-one Docker image. The image launches the Jaeger UI, collector, query, and agent:

TIP: To check if the docker container is running, use: Docker ps.

Once the container starts, open http://localhost:16686/ to access the Jaeger UI. The container runs the Jaeger backend with an in-memory store, which is initially empty, so there is not much we can do with the UI right now since the store has no traces.

Creating Traces on Jaeger UI

1. Create a Python program to create Traces

Let’s generate some traces using a simple python program. You can clone the Jaeger-Opentracing repository given below for a sample program that is used in this blog_._

The Python program takes a movie name as an argument and calls three functions that get the cinema details, movie showtime details, and finally, book a movie ticket.

It creates some random delays in all the functions to make it more interesting, as, in reality, the functions would take a certain time to get the details. Also, the function throws random errors to give us a feel of how the traces of a real-life application may look like in case of failures.

Here is a brief description of how OpenTracing has been used in the program:

Initializing a tracer:
Using the tracer instance:
Starting new child spans using start_span:
Using Tags:
Using Logs:

2. Run the python program

Now, check your Jaeger UI, you can see a new service “booking” added. Select the service and click on “Find Traces” to see the traces of your service. Every time you run the program a new trace will be created.

You can now compare the duration of traces through the graph shown above. You can also filter traces using “Tags” section under “Find Traces”. For example, Setting “error=true” tag will filter out all the jobs that have errors.

To view the detailed trace, you can select a specific trace instance and check details like the time taken by each service, errors during execution and logs.

Conclusion

In this blog, we’ve described the importance and benefits of OpenTracing, one of the core pillars of modern applications. We also explored how distributed tracer Jaeger collect and store traces while revealing inefficient portions of our applications. It is fully compatible with OpenTracing API and has a number of clients for different programming languages including Java, Go, Node.js, Python, PHP, and more.

References

Logging

A good/meaningful logging system is a system that everyone can use and understand. How Digit Logging is configured.

Introduction

The logging concern is one of the most complicated parts of our microservices. Microservices should stay as pure as possible. So, we shouldn’t use any library if we can (like logging, monitoring, resilience library dependencies). It means, every dependency can change any time and then usually, we must do that change for the other microservices. There is a lot of work here. Instead of that, we need to handle these dependencies with a more generic way. For logging, the way is the stdout logging. For most of the programming languages, logging to stdout is the default way and probably no additional change required at the beginning.

What is needed to build a meaningful logging system in MSA?

1. Use a Unique Id to correlate Requests

In MSA, services interact with each other through an HTTP endpoint. End users only know about API Contract (Request/Response), and don’t know how exactly do services work.

“A service” will call “B service” and “C service”. Once the request chain is complete, “X service” might be able to respond to the end-user who initiated the request. Let’s say you already have a logging system that captures error logs for each service. If you find an error in “X service”, it would be better if you know exactly whether the error was caused by “A service” or “C service”. If the error is informative enough for you. But if that isn’t the case, the correct way to reproduce that error is to know all requests and services that involved. Once you implement Correlation Id, you only need to look for that ID in the logging system. And you will get all logs from services that were part of the main request to the system.

2. Centralise Logging data in one place

The application usually adds more features as time goes by. Go along with this, there are so many services will be created new (my project started with 12 services, and now we have 20). These services could be hosted on different servers. Let’s imagine, what will happen if you store logging on different servers? — you will have to access to each individual server to read logs, then trying to correlate problems. Instead, you have everything that you need in one dashboard by centralized logging data in one place. If would save your time so much.

3. Define the format for logging

Applying MSA allows you to use different technology stacks for each service. For example, you can use .Net Core for Buy service, Java for Shipping service and Python for Inventory service. However, it also impacts to log format of each service. It’s even more complicated as some logs need more fields than others.

Based on my experience, I’d like to suggest JSON as a standard format for logging data. JSON allows you to have multiple levels for your data so that, when necessary, you can get more semantic info in a single log event.

4. Log useful/meaningful data

When we see the log one would want to know everything! What? When? Where?… even Who? — don’t think that we need to know exactly which person causes the problem to blame them :) Because, contacting the right person also helps you to resolve issues quicker. You can log all the data that you get. However, let us give some specific fields. This might help to figure out what really need to log.

When? — Time (with full date format): It doesn’t require using UTC format. But the timezone has to be the same for everyone that needs to look at the logs.
What? — Stack errors: All exception objects should be passed to the logging system.
Where? — Besides service name as we using MSA. We also need function name, class or file name where the error occurred. — Don’t guess anything, it might waste your time.
Who? — The IP address of the client and user name if any. Make sure don’t use this information to blame your teammates :)

Bear in mind that, logging system is not only for developers. It’s also used by others (system admin, tester…) So, you should consider logging data that everyone can use and understand.

5. Consider storing Personally identifiable information (PII) of your end-users

Sometimes, you log requests from end-users that contain PII. We need to be careful, it might violate GDPR.

Logging approaches in MSA

There are two techniques for logging in MSA. Each service will implement the logging mechanism by itself and using one logging service for all services. Both of them have Good and Not Good points. — I’m using both these approaches in my project.

Implement Logging in each service

With this approach, we can easily define the logging strategy/library for each service. For example, with service written by java we can use Log4j.

The problem with this approach is that it requires each service to implement its own logging methods. Not only is this redundant, but it also adds complexity and increases the difficulty of changing logging behaviour across multiple services.

2. Implement central Logging service

If you don’t want to implement logging in each service separately. You can consider implementing a central service for logging. This service will help you with processing, formatting and storing log data.

This approach might help to reduce the complexity of your application. However, you might get lost your log data if that service is down.

Monitoring & Alerts

Prometheus is an open-source system monitoring and alerting toolkit originally built at SoundCloud.

prometheus-operator chart includes multiple components and is suitable for a variety of use-cases.

The default installation is intended to suit monitoring a kubernetes cluster the chart is deployed onto. It closely matches the kube-prometheus project.

service monitors to scrape internal kubernetes components
- kube-apiserver
- kube-scheduler
- kube-controller-manager
- etcd
- kube-dns/coredns
- kube-proxy

With the installation, the chart also includes dashboards and alerts.

Deployment steps

Add environment variable to the respective env config file

Update the configs branch (like for qa.yaml added qa branch)

Add monitoring-dashboards folder to respective configs branch.
Enable the nginx-ingress monitoring and redeploy the nginx-ingress.

Add alertmanager secret in respective.secrets.yaml
If you want you can change the slack channel and other details like group_wait , group_interval and repeat_interval according to your values.

Deploy the prometheus-operator using go cmd or deploy using Jenkins.

go run main.go deploy -e   -c 'prometheus-operator,grafana,prometheues-kafka-exporter'

To create a new panel in the existing dashboard

Set all required queries and apply the changes. Export the JSON file by clicking on t the save dashboard

Update the existing *-dashboard.json file from configs monitoring-dashboards folder with a newly exported JSON file.

Distributed Tracing

We will discuss distributed tracing system Jaeger and how it helps in troubleshooting DIGIT.

Introduction

Drift to Microservice Architecture

OpenTracing

Here are some basic terminologies of Opentracing:

Span — It represents a logical unit of work that has an operation name, the start time of the operation, and the duration.

OpenTracing is a way for services to “describe and propagate distributed traces without knowledge of the underlying OpenTracing implementation.”

In real life, applications are even more complex and with the increasing complexity of applications, monitoring the applications has been a tedious task. Opentracing helps us to easily monitor:

Spans of services
Time taken by each service
Latency between the services
Hierarchy of services
Errors or exceptions during execution of each service.

Jaeger: A Distributed Tracing System by Uber

Jaeger is used for monitoring and troubleshooting microservices-based distributed systems, including:

Distributed transaction monitoring
Performance and latency optimization
Root cause analysis
Service dependency analysis
Distributed context propagation

Major Components of Jaeger

Jaeger Client Libraries — Jaeger clients are language-specific implementations of the OpenTracing API.

Query — Query is a service that retrieves traces from storage and hosts a UI to display them.

Ingester — Ingester is a service that reads from Kafka topic and writes to another storage backend (Cassandra, Elasticsearch).

Running Jaeger in a Docker Container

First, install Jaeger Client on your machine:
Now, let’s run Jaeger backend as an all-in-one Docker image. The image launches the Jaeger UI, collector, query, and agent:

TIP: To check if the docker container is running, use: Docker ps.

Creating Traces on Jaeger UI

1. Create a Python program to create Traces

Let’s generate some traces using a simple python program. You can clone the Jaeger-Opentracing repository given below for a sample program that is used in this blog_._

The Python program takes a movie name as an argument and calls three functions that get the cinema details, movie showtime details, and finally, book a movie ticket.

Here is a brief description of how OpenTracing has been used in the program:

Initializing a tracer:
Using the tracer instance:
Starting new child spans using start_span:
Using Tags:
Using Logs:

2. Run the python program

To view the detailed trace, you can select a specific trace instance and check details like the time taken by each service, errors during execution and logs.

Conclusion

References

Logging

A good/meaningful logging system is a system that everyone can use and understand. How Digit Logging is configured.

Introduction

What is needed to build a meaningful logging system in MSA?

1. Use a Unique Id to correlate Requests

In MSA, services interact with each other through an HTTP endpoint. End users only know about API Contract (Request/Response), and don’t know how exactly do services work.

2. Centralise Logging data in one place

3. Define the format for logging

4. Log useful/meaningful data

When? — Time (with full date format): It doesn’t require using UTC format. But the timezone has to be the same for everyone that needs to look at the logs.
What? — Stack errors: All exception objects should be passed to the logging system.
Where? — Besides service name as we using MSA. We also need function name, class or file name where the error occurred. — Don’t guess anything, it might waste your time.
Who? — The IP address of the client and user name if any. Make sure don’t use this information to blame your teammates :)

Bear in mind that, logging system is not only for developers. It’s also used by others (system admin, tester…) So, you should consider logging data that everyone can use and understand.

5. Consider storing Personally identifiable information (PII) of your end-users

Sometimes, you log requests from end-users that contain PII. We need to be careful, it might violate GDPR.

Logging approaches in MSA

Implement Logging in each service

With this approach, we can easily define the logging strategy/library for each service. For example, with service written by java we can use Log4j.

2. Implement central Logging service

This approach might help to reduce the complexity of your application. However, you might get lost your log data if that service is down.