This section addresses the key areas of concern and its potential remedial steps
Loading...
Loading...
Loading...
We will discuss distributed tracing system Jaeger and how it helps in troubleshooting DIGIT.
Distributed tracing is a method used to profile and monitor applications, especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance.
OpenTracing has been a key capability when it comes to microservices-based distributed systems like DIGIT. We’ll start with the introduction of OpenTracing, explaining what it is and why it is important We shall also set up Jaeger and learn to use it for monitoring and troubleshooting.
Microservice Architecture has now become the obvious choice for application developers. In the Microservice Architecture, a monolithic application is broken down into a group of independently deployed services. In simple words, an application is more like a collection of microservices. When we have millions of such intertwined microservices working together, it’s almost impossible to map the inter-dependencies of these services and understand the execution of a request.
In case of a failure in a monolithic application, it is much easier to understand the path of a transaction and do the root cause analysis with the help of logging frameworks. But in a microservice architecture, logging alone fails to deliver the complete picture.
Is this service the first one in the call chain? How do I span all these services to get insight into the application? With questions like these, it becomes a significantly larger problem to debug a set of interdependent distributed services in comparison to a single monolithic application, making OpenTracing more and more popular.
The OpenTracing API provides a standard, vendor-neutral framework for instrumentation. This means that if a developer wants to try out a different distributed tracing system, then instead of repeating the whole instrumentation process for the new distributed tracing system, the developer can simply change the configuration of the Tracer.
Here are some basic terminologies of Opentracing:
Span — It represents a logical unit of work that has an operation name, the start time of the operation, and the duration.
Trace — A Trace tells the story of a transaction or workflow as it propagates through a distributed system. It is simply a set of spans sharing a TraceID. Each component in a distributed system contributes its own span.
OpenTracing is a way for services to “describe and propagate distributed traces without knowledge of the underlying OpenTracing implementation.”
Let us take the example of a service like egov-property service (or any other DIGIT service). A service like this requires many other microservices to check that the location is available, proper payment credentials are received, and enough details exist for the ULB to process the property tac. If either one of those microservice fails, then the entire transaction fails. In such a case, having logs just for the main property service wouldn’t be very useful for debugging. However, if you were able to analyze each service you wouldn’t have to scratch your head to troubleshoot which microservice failed and what made it fail.
In real life, applications are even more complex and with the increasing complexity of applications, monitoring the applications has been a tedious task. Opentracing helps us to easily monitor:
Spans of services
Time taken by each service
Latency between the services
Hierarchy of services
Errors or exceptions during execution of each service.
Jaeger is used for monitoring and troubleshooting microservices-based distributed systems, including:
Distributed transaction monitoring
Performance and latency optimization
Root cause analysis
Service dependency analysis
Distributed context propagation
Jaeger Client Libraries — Jaeger clients are language-specific implementations of the OpenTracing API.
Agent — The Jaeger agent is a network daemon that listens for spans sent over UDP, which it batches and sends to the collector. It is designed to be deployed to all hosts as an infrastructure component. The agent abstracts the routing and discovery of the collectors away from the client.
Collector — The Jaeger collector receives traces from Jaeger agents and runs them through a processing pipeline. Currently, the pipeline validates traces, indexes them, performs transformations, and finally, stores them. Jaeger’s storage is a pluggable component which currently supports Cassandra, Elasticsearch, and Kafka.
Query — Query is a service that retrieves traces from storage and hosts a UI to display them.
Ingester — Ingester is a service that reads from Kafka topic and writes to another storage backend (Cassandra, Elasticsearch).
First, install Jaeger Client on your machine:
Now, let’s run Jaeger backend as an all-in-one Docker image. The image launches the Jaeger UI, collector, query, and agent:
TIP: To check if the docker container is running, use: Docker ps.
Once the container starts, open http://localhost:16686/ to access the Jaeger UI. The container runs the Jaeger backend with an in-memory store, which is initially empty, so there is not much we can do with the UI right now since the store has no traces.
1. Create a Python program to create Traces
Let’s generate some traces using a simple python program. You can clone the Jaeger-Opentracing repository given below for a sample program that is used in this blog_._
The Python program takes a movie name as an argument and calls three functions that get the cinema details, movie showtime details, and finally, book a movie ticket.
It creates some random delays in all the functions to make it more interesting, as, in reality, the functions would take a certain time to get the details. Also, the function throws random errors to give us a feel of how the traces of a real-life application may look like in case of failures.
Here is a brief description of how OpenTracing has been used in the program:
Initializing a tracer:
Using the tracer instance:
Starting new child spans using start_span:
Using Tags:
Using Logs:
2. Run the python program
Now, check your Jaeger UI, you can see a new service “booking” added. Select the service and click on “Find Traces” to see the traces of your service. Every time you run the program a new trace will be created.
You can now compare the duration of traces through the graph shown above. You can also filter traces using “Tags” section under “Find Traces”. For example, Setting “error=true” tag will filter out all the jobs that have errors.
To view the detailed trace, you can select a specific trace instance and check details like the time taken by each service, errors during execution and logs.
In this blog, we’ve described the importance and benefits of OpenTracing, one of the core pillars of modern applications. We also explored how distributed tracer Jaeger collect and store traces while revealing inefficient portions of our applications. It is fully compatible with OpenTracing API and has a number of clients for different programming languages including Java, Go, Node.js, Python, PHP, and more.
A good/meaningful logging system is a system that everyone can use and understand. How Digit Logging is configured.
The logging concern is one of the most complicated parts of our microservices. Microservices should stay as pure as possible. So, we shouldn’t use any library if we can (like logging, monitoring, resilience library dependencies). It means, every dependency can change any time and then usually, we must do that change for the other microservices. There is a lot of work here. Instead of that, we need to handle these dependencies with a more generic way. For logging, the way is the stdout logging. For most of the programming languages, logging to stdout is the default way and probably no additional change required at the beginning.
In MSA, services interact with each other through an HTTP endpoint. End users only know about API Contract (Request/Response), and don’t know how exactly do services work.
“A service” will call “B service” and “C service”. Once the request chain is complete, “X service” might be able to respond to the end-user who initiated the request. Let’s say you already have a logging system that captures error logs for each service. If you find an error in “X service”, it would be better if you know exactly whether the error was caused by “A service” or “C service”. If the error is informative enough for you. But if that isn’t the case, the correct way to reproduce that error is to know all requests and services that involved. Once you implement Correlation Id, you only need to look for that ID in the logging system. And you will get all logs from services that were part of the main request to the system.
The application usually adds more features as time goes by. Go along with this, there are so many services will be created new (my project started with 12 services, and now we have 20). These services could be hosted on different servers. Let’s imagine, what will happen if you store logging on different servers? — you will have to access to each individual server to read logs, then trying to correlate problems. Instead, you have everything that you need in one dashboard by centralized logging data in one place. If would save your time so much.
Applying MSA allows you to use different technology stacks for each service. For example, you can use .Net Core for Buy service, Java for Shipping service and Python for Inventory service. However, it also impacts to log format of each service. It’s even more complicated as some logs need more fields than others.
Based on my experience, I’d like to suggest JSON as a standard format for logging data. JSON allows you to have multiple levels for your data so that, when necessary, you can get more semantic info in a single log event.
When we see the log one would want to know everything! What? When? Where?… even Who? — don’t think that we need to know exactly which person causes the problem to blame them :) Because, contacting the right person also helps you to resolve issues quicker. You can log all the data that you get. However, let us give some specific fields. This might help to figure out what really need to log.
When? — Time (with full date format): It doesn’t require using UTC format. But the timezone has to be the same for everyone that needs to look at the logs.
What? — Stack errors: All exception objects should be passed to the logging system.
Where? — Besides service name as we using MSA. We also need function name, class or file name where the error occurred. — Don’t guess anything, it might waste your time.
Who? — The IP address of the client and user name if any. Make sure don’t use this information to blame your teammates :)
Bear in mind that, logging system is not only for developers. It’s also used by others (system admin, tester…) So, you should consider logging data that everyone can use and understand.
Sometimes, you log requests from end-users that contain PII. We need to be careful, it might violate GDPR.
There are two techniques for logging in MSA. Each service will implement the logging mechanism by itself and using one logging service for all services. Both of them have Good and Not Good points. — I’m using both these approaches in my project.
Implement Logging in each service
With this approach, we can easily define the logging strategy/library for each service. For example, with service written by java we can use Log4j.
The problem with this approach is that it requires each service to implement its own logging methods. Not only is this redundant, but it also adds complexity and increases the difficulty of changing logging behaviour across multiple services.
2. Implement central Logging service
If you don’t want to implement logging in each service separately. You can consider implementing a central service for logging. This service will help you with processing, formatting and storing log data.
This approach might help to reduce the complexity of your application. However, you might get lost your log data if that service is down.
Prometheus is an open-source system monitoring and alerting toolkit originally built at SoundCloud.
prometheus-operator chart includes multiple components and is suitable for a variety of use-cases.
The default installation is intended to suit monitoring a kubernetes cluster the chart is deployed onto. It closely matches the kube-prometheus project.
service monitors to scrape internal kubernetes components
kube-apiserver
kube-scheduler
kube-controller-manager
etcd
kube-dns/coredns
kube-proxy
With the installation, the chart also includes dashboards and alerts.
Deployment steps
Add environment variable to the respective env config file
Update the configs branch (like for qa.yaml added qa branch)
Add monitoring-dashboards folder to respective configs branch.
Enable the nginx-ingress monitoring and redeploy the nginx-ingress.
Add alertmanager secret in respective.secrets.yaml
If you want you can change the slack channel and other details like group_wait , group_interval and repeat_interval according to your values.
Deploy the prometheus-operator using go cmd or deploy using Jenkins.
To create a new panel in the existing dashboard
Login to dashboard and click on add panel
Set all required queries and apply the changes. Export the JSON file by clicking on t the save dashboard
Update the existing *-dashboard.json file from configs monitoring-dashboards folder with a newly exported JSON file.