The Indexer Service operates independently and is responsible for all indexing tasks on the DIGIT platform. It processes records from specific Kafka topics and utilizes the corresponding index configuration defined in YAML files by each module.
Objectives:
Efficiently read and process records from Kafka topics.
Retrieve and apply appropriate index configurations from YAML files.
To provide a one-stop framework for indexing the data to Elasticsearch.
To create provisions for indexing live data, reindexing from one index to the other and indexing legacy data from the data store.
Before you proceed with the configuration, make sure the following pre-requisites are met -
Prior knowledge of Java/J2EE
Prior knowledge of SpringBoot
Prior knowledge of Elasticsearch
Prior knowledge of REST APIs and related concepts like path parameters, headers, JSON etc.
Prior knowledge of Kafka and related concepts like Producer, Consumer, Topic etc.
Performs three major tasks namely: LiveIndex, Reindex and LegacyIndex.
LiveIndex: Task of indexing the live transaction data on the platform. This keeps the es data in sync with the DB.
Reindex: Task of indexing data from one index to the other. ES already provides this feature, the indexer does the same but with data transformation.
LegacyIndex: Task of indexing legacy data from the tables to ES.
Provides flexibility to index the entire object, a part of the object or an entirely different custom object all using one input JSON from modules.
Provides features for customizing index JSON by field mapping, field masking, data enrichment through external APIs and data denormalization using MDMS.
One-stop shop for all the es index requirements with easy-to-write and easy-to-maintain configuration files.
Designed as a consumer to save API overhead. The consumer configs are written from scratch for complete control over consumer behaviour.
Step 1: Write the configuration as per your requirement. The structure of the config file is explained later in the same doc.
Step 2: Check in the config file to a remote location preferably Github. Currently, we check the files into this folder https://github.com/egovernments/configs/tree/DEV/egov-indexer -for dev
Step 3: Provide the absolute path of the checked-in file to the DevOps team. They will add it to the file-read path of egov-indexer by updating the environment manifest file, ensuring it is read at the time of the application's startup.
Step 4: Run the egov-indexer app. Since it is a consumer, it starts listening to the configured topics and indexes the data.
Click here to access the indexer configuration details.
a) POST /{key}/_index
Receive data and index. There should be a mapping with the topic as {key} in index config files.
b) POST /_reindex
This is used to migrate data from one index to another index
c) POST /_legacyindex
This is to run the LegacyIndex job to index data from DB. In the request body, the URL of the service which would be called by the indexer service to pick data must be mentioned.
In legacy indexing and for collection-service records LiveIndex kafka-connect is used to do part of pushing records to elastic search. For more details please refer to the document mentioned in the document list.
Indexer uses a config file per module to store all the configurations pertaining to that module. The Indexer reads multiple such files at start-up to support indexing for all the configured modules. In config, we define source and, destination elastic search index names, custom mappings for data transformation and mappings for data enrichment.
Below is the sample configuration for indexing TL application creation data into elastic search.
The table below lists the key configuration variables.
Variable Name | Descriptions |
---|---|
serviceName
Name of the module to which this configuration belongs.
summary
Summary of the module.
version
Version of the configuration.
mappings
List of definitions within the module. Every definition corresponds to one index requirement. Which means, every object received onto the kafka queue can be used to create multiple indexes, each of these indexes will need configuration, all such configurations belonging to one topic forms one entry in the mappings list. The keys listed henceforth together form one definition and multiple such definitions are part of this mappings key.
topic
The topic on which the data is to be received to activate this particular configuration.
configKey
Key to identify to what type of job is this config for. values: INDEX, REINDEX, LEGACYINDEX. INDEX: LiveIndex, REINDEX: Reindex, LEGACYINDEX: LegacyIndex.
indexes
Key to configure multiple index configurations for the data received on a particular topic. Multiple indexes based on a different requirement can be created using the same object.
name
Index name on the elastic search. (Index will be created if it doesn't exist with this name.)
type
Document type within that index to which the index json has to go. (Elasticsearch uses the structure of index/type/docId to locate any file within index/type with id = docId)
id
Takes comma-separated JsonPaths. The JSONPath is applied on the record received on the queue, the values hence obtained are appended and used as ID for the record.
isBulk
Boolean key to identify whether the JSON received on the Queue is from a Bulk API. In simple words, whether the JSON contains a list at the top level.
jsonPath
Key to be used in case of indexing a part of the input JSON and in case of indexing a custom json where the values for custom json are to be fetched from this part of the input.
timeStampField
JSONPath of the field in the input which can be used to obtain the timestamp of the input.
fieldsToBeMasked
A list of JSONPaths of the fields of the input to be masked in the index.
customJsonMapping
Key to be used while building an entirely different object using the input JSON on the queue
indexMapping
A skeleton/mapping of the JSON that is to be indexed. Note that, this JSON must always contain a key called "Data" at the top-level and the custom mapping begins within this key. This is only a convention to smoothen dashboarding on Kibana when data from multiple indexes have to be fetched for a single dashboard.
fieldMapping
Contains a list of configurations. Each configuration contains keys to identify the field of the input JSON that has to be mapped to the fields of the index json which is mentioned in the key 'indexMapping' in the config.
inJsonPath
JSONPath of the field from the input.
outJsonPath
JSONPath of the field of the index json.
externalUriMapping
Contains a list of configurations. Each configuration contains keys to identify the field of the input JSON that is to be enriched using APIs from the external services. The configuration for those APIs also is a part of this.
path
URI of the API to be used. (it should be POST/_search API.)
queryParam
Configuration of the query params to be used for the API call. It is a comma-separated key-value pair, where the key is the parameter name as per the API contract and value is the JSONPath of the field to be equated against this parameter.
apiRequest
Request Body of the API. (Since we only use _search APIs, it should be only RequestInfo.)
uriResponseMapping
Contains a list of configuration. Each configuration contains two keys: One is a JSONPath to identify the field from response, Second is also a JSONPath to map the response field to a field of the index json mentioned in the key 'indexMapping'.
mdmsMapping
Contains a list of configurations. Each configuration contains keys to identify the field of the input JSON that is to be denormalized using APIs from the MDMS service. The configuration for those MDMS APIs also is a part of this.
path
URI of the API to be used. (it should be POST/_search API.)
moduleName
Module Name from MDMS.
masterName
Master Name from MDMS.
tenantId
Tenant id to be used.
filter
Filter to be applied to the data to be fetched.
filterMapping
Maps the field of input json to variables in the filter
variable
Variable in the filter
valueJsonpath
JSONPath of the input to be mapped to the variable.