[Elasticsearch] How to know if indexing operations are overloading the cluster.
The indexing pressure on nodes may increase due to external operations performed by the user or even internal operations performed by the elasticsearch cluster itself, such as recovery and cross-cluster replication.
If the cluster is heavily overloaded with indexing work, other operations such as search, cluster coordination and background processing will be severely affected
To avoid this problem, Elasticsearch internally monitors the indexing load. When the load reaches the standard memory usage limit of 10% of HEAP, new indexing jobs are rejected and this means that there will be data loss.
In order to prevent it from happening, it is recommended to monitor how your current situation of your cluster nodes stands regarding to this limit.
As of version 7.9, the _nodes/stats/ API retrieves this information in a very simple way.
"indexing_pressure" : {
"memory" : {
"current" : {
"combined_coordinating_and_primary_in_bytes" : 0,
"coordinating_in_bytes" : 0,
"primary_in_bytes" : 0,
"replica_in_bytes" : 0,
"all_in_bytes" : 0
},
"total" : {
"combined_coordinating_and_primary_in_bytes" : 12171,
"coordinating_in_bytes" : 12171,
"primary_in_bytes" : 13995,
"replica_in_bytes" : 0,
"all_in_bytes" : 12171,
"coordinating_rejections" : 0,
"primary_rejections" : 0,
"replica_rejections" : 0
},
"limit_in_bytes" : 53687091
}
}
}
In index_pressure.limit_in_bytes we can check the memory limit in bytes for writing operations
In indexing_pressure.memory.current.all_in_bytes it is possible to check how much in bytes the entire indexing phase was using from memory at the time the API was called.
By dividing the limit by the one used, it is possible to obtain the percentage of the limit used, this information is extremely valuable to know if the indexing works are overloading the cluster.
It is also interesting to keep watching the following properties.
index_pressure.memory.total.coordinating_rejections, total of requests rejected in the coordination phase, since the node was initiated.
index_pressure.memory.total.primary_rejections, total of rejected requests in the indexing phase of the primary, since the node was started.
index_pressure.memory.total.primary_rejections, total number of requests rejected in the replication phase, since the node was started.
If the API shows any rejection, it means that at some point since the node was started, data has been lost. From there on a more detailed analysis is needed, possibly more nodes should be added to the cluster.
At the moment it is possible to track only the memory cost of the indexing, but the next versions will bring the possibility to track the CPU cost as well.
Follow here the development progress of this feature.
You can read more about index_pressure here.