Operational Best Practices
Introduction
This article provides a DIGIT Infra overview, Guidelines for Operational Excellence while DIGIT is deployed on SDC, NIC or any commercial clouds along with the recommendations and segregation of duties (SoD). It helps to plan the procurement and build the necessary capabilities to deploy and implement DIGIT.
In shared control, the state program team/partners can consider these guidelines and must provide their own control implementation to the state’s cloud infrastructure and partners for standard and smooth operational excellence.
Operational Recommendations
DIGIT strongly recommends Site reliability engineering (SRE) principles as a key means to bridge development and operations by applying a software engineering mindset to the system and IT administration topics. In general, an SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
Monitoring Tools Recommendations: Commercial clouds like AWS, Azure and GCP offer sophisticated monitoring solutions across various infra levels like CloudWatch and StackDriver, in the absence of such managed services to monitor we can look at various best practices and tools listed below which helps debugging and troubleshooting efficiently.
Key Standard Operating Procedures (SOPs)
Segregation of duties and responsibilities.
SME and SPOCs for L1.5 support along with the SLAs defined.
Ticketing system to manage incidents, converge and collaborate on various operational issues.
Monitoring dashboards at various levels like Infrastructure, Networks and applications.
Transparency of monitoring data and collaboration between teams.
Periodic remote sync-up meetings, acceptance and attendance to the meeting.
Ability to see stakeholders' availability of calendar time to schedule meetings.
Periodic (weekly, monthly) summary reports of the various infra, operations incident categories.
Communication channels and synchronization on a regular basis and also upon critical issues, changes, upgrades, releases etc.
Segregation of Duties
While DIGIT is deployed at state cloud infrastructure, it is essential to identify and distinguish the responsibilities between Infrastructure, Operations and Implementation partners. Identify these teams and assign SPOC, define responsibilities and Incident management followed to visualize, track issues and manage dependencies between teams. Essentially these are monitored through dashboards and alerts are sent to the stakeholders proactively. eGov team can provide consultation and training on a need basis depending on any of the below categories.
and State program team - Refers to the owner for the whole DIGIT implementation, application rollouts, and capacity building. Responsible for identifying and synchronizing the operating mechanism between the below teams.
Implementation partner - Refers to the DIGIT Implementation, application performance monitoring for errors, logs scrutiny, TPS on peak load, distributed tracing, DB queries analysis, etc.
Operations team - this team could be an extension of the implementation team and is responsible for DIGIT deployments, configurations, CI/CD, change management, traffic monitoring and alerting, log monitoring and dashboard, application security, DB Backups, application uptime, etc.
**State IT/Cloud team -**Refers to state infra team for the Infra, network architecture, LAN network speed, internet speed, OS Licensing and upgrade, patch, compute, memory, disk, firewall, IOPS, security, access, SSL, DNS, data backups/recovery, snapshots, capacity monitoring dashboard.
Skills Required to Set up, Operate and Maintain DIGIT on SDC
Tools/Skills | Specification | Weightage (1-5) | Yes/No |
---|---|---|---|
System Administration | Linux Administration, troubleshooting, OS Installation, Package Management, Security Updates, Firewall configuration, Performance tuning, Recovery, Networking, Routing tables, etc | 4 | |
Containers/Dockers | Build/Push docker containers, tune and maintain containers, Startup scripts, Troubleshooting dockers containers. | 2 | |
Kubernetes | Setup kubernetes cluster on bare-metal and VMs using kubeadm/kubespary, terraform, etc. Strong understanding of various kubernetes components, configurations, kubectl commands, RBAC. Creating and attaching persistent volumes, log aggregation, deployments, networking, service discovery, Rolling updates. Scaling pods, deployments, worker nodes, node affinity, secrets, configMaps, etc.. | 3 | |
Database Administration | Setup PostGres DB, Set up read replicas, Backup, Log, DB RBAC setup, SQL Queries | 3 | |
Docker Registry | Setup docker registry and manage | 2 | |
SCM/Git | Source Code management, branches, forking, tagging, Pull Requests, etc. | 4 | |
CI Setup | Jenkins Setup, Master-slave configuration, plugins, jenkinsfile, groovy scripting, Jenkins CI Jobs for Maven, Node application, deployment jobs, etc. | 4 | |
Artifact management | Code artifact management, versioning | 1 | |
Apache Tomcat | Web server setup, configuration, load balancing, sticky sessions, etc | 2 | |
WildFly JBoss | Application server setup, configuration, etc. | 3 | |
Spring Boot | Build and deploy spring boot applications | 2 | |
NodeJS | NPM Setup and build node applications | 2 | |
Scripting | Shell scripting, python scripting. | 4 | |
Log Management | Aggregating system, container logs, troubleshooting. Monitoring Dashboard for logs using prometheus, fluentd, Kibana, Grafana, etc. | 3 | |
WordPress | Multi-tenant portal setup and maintain | 2 |
Last updated