Adventure Tale: Upgrading an Elasticsearch Cluster with 1.35 billion documents
From 6.2 to 8.6.2
Elasticsearch here at Frontpage was running a pretty outdated version of ElasticSearch(ES). A lot of Changes happened to ES ranging from License changes to API changes.
I went through an overall changelog to figure out what was depracated. Still missed some of it. We were using our custom Alpine image of Elasticsearch.
One of the major changes that happened between the version we were running and we wanted to Upgrade to was cluster creation, bootstrapping and node discovery. Another part was how removal of _doc which lead to reindexing of all documents in version 7.17.9
My Previous Experience with Elasticsearch was limited to writing small and fixing queries of Amazon Opensearch. Had an in depth understanding of the Architecture which was really a plus to Optimize the reindexing process.
Took around 5 days to :-
create image for following versions in alpine
6.8.23
7.0.0
7.17.9
8.0.0
8.6.2
Due to some alpine bug, it was not able to discover stateful sets pods. So had to redo the image for ubuntu.
Figure out how to reindexing at around 150k rps
Preparing for Upgrade
curl for disabling shard thrashing while upgrading
curl --location --globoff --request PUT 'http://{{url}}/_cluster/settings' \
--header 'Content-Type: application/json' \
--data '{
"persistent": {
"cluster.routing.rebalance.enable": "all",
"cluster.routing.allocation.enable": "primaries"
}
}'
Curl to make sure no dangling indices are created
curl --location --globoff --request PUT '{{url}}/*/_settings' \
--header 'Content-Type: application/json' \
--data '{
"index": {
"blocks.read_only": true
}
}'
curl --location --globoff --request POST 'http://{{url}}/_flush' \
--header 'Content-Type: application/json' \
--data ''
Tweaks for Faster Indexing
We had 2 data nodes, so having an even no of shard with balancing enabled helped balance load properly between nodes.
curl for index templates
We created index templates for each index by putting custom settings, We found out mmapfs works really fast for us than hybridfs. refresh_interval can be -1
but having 30s helps us to know how many documents have been indexed.
Other settings are to reduce no of merges. And not spend too much time in translog fsync.
curl --location --globoff --request PUT '{{url}}/_template/index_template' \
--header 'Content-Type: application/json' \
--data '{
"order": 0,
"index_patterns": [
"toy_index-*"
],
"settings": {
"index.merge.policy.max_merge_at_once": 10,
"index.merge.scheduler.max_thread_count": 10,
"index.merge.scheduler.max_merge_count": 25,
"index.merge.policy.floor_segment": "200mb",
"index.merge.policy.segments_per_tier": 55,
"index.merge.policy.max_merged_segment": "10gb",
"index.translog.durability": "async",
"index.warmer.enabled": "false",
"index": {
"refresh_interval": "30s",
"store": {
"type": "mmapfs"
},
"number_of_shards": "6",
"number_of_replicas": "0",
"blocks": {
"write": "false"
}
}
},
"aliases": {}
}'
curl for reindexing
Adding scroll, slices and rps.
curl --location --globoff 'http://{{url}}/_reindex?slices=12&wait_for_completion=false&scroll=30m&requests_per_second=100000' \
--header 'Content-Type: application/json' \
--data '{
"source": {
"index": "toy_index"
},
"dest": {
"index": "toy_index-reindexed"
}
}'
Finding out task id for reindexing
curl --location --globoff 'http://{{url}}/_tasks/?detailed=true&actions=indices%3Adata%2Fwrite%2Freindex'
Increasing rps for a task id
There is a rethrottle api to increase existing rps
curl --location --globoff --request POST 'http://{{url}}/_reindex/{parent_task_id}/_rethrottle?requests_per_second=1200000' \
--header 'Content-Type: application/json' \
--data ''
By modifying .yml file we gave around 2gigs of buffer to each indexes and 32 gigs to jvm. Our nodes were spot instances of 64 gigs and 16 core
Apart from ES level changes we also modified node level config
sudo sysctl -w vm.vfs_cache_pressure=50
sudo sysctl -w vm.swappiness=3
Apart from the settings, you have to give a healthy amount of memory to master in a write heavy cluster, or even master can also be a chokepoint
Armed with this experience I was pretty confident that we will be able to migrate pretty easily without incident. And Oh boy I was so so wrong.
Incidents
Small
missed that update api was changed in new version of elastic
missed a custom string analyzer was present in one of the indices
missed [date_histogram] unknown field [interval] did you mean [fixed_interval] which lead
Cluster Fucked
How it happened
We have a cron-job to downscale some of our pods
Pod downscale lead to node downscale
Elasticsearch master was scheduled on same nodes which lead 2 out of 3 ES master being killed at once
Because we deployed es-master with ephemeral storage and not persistent volume, when two pods that was killed, when they came up, they created a new cluster and cluster uuid changed, the old master and data node was not able to connect with the new cluster
How we fixed
command:
- /bin/bash
- '-c'
- '--'
args:
- while true; do sleep 30; done;
We added above in es-data and removed liveness probe. Then we sshed in pod and `cd bin`
https://www.elastic.co/guide/en/elasticsearch/reference/current/node-tool.html
./elasticsearch-node detach-cluster
then ran the the data node manually by
./run.sh (our custom script to start node)
Did the above for another data node as well.
To restore the dangling indices
GET /_dangling
POST /_dangling/<index-uuid>?accept_data_loss=true
Created new stateful set of master with pvc attached. Then ran the below command for old masters
POST /_cluster/voting_config_exclusions?node_names=old-master-0,old-master-1,old-master-2
Which elected one of the pod from new stateful set
Lessons
Always add pod disruption policy for infra pods ALWAYS.
NEVER ever schedule the infra pods with your general pods
Return on Investment
We reduced 95p for our search requests by a huge 300% brought it under 1.5 seconds. Running on same infra.