Adventure Tale: Upgrading an Elasticsearch Cluster with 1.35 billion documents

From 6.2 to 8.6.2

May 02, 2023

Elasticsearch here at Frontpage was running a pretty outdated version of ElasticSearch(ES). A lot of Changes happened to ES ranging from License changes to API changes.

I went through an overall changelog to figure out what was depracated. Still missed some of it. We were using our custom Alpine image of Elasticsearch.

One of the major changes that happened between the version we were running and we wanted to Upgrade to was cluster creation, bootstrapping and node discovery. Another part was how removal of _doc which lead to reindexing of all documents in version 7.17.9

My Previous Experience with Elasticsearch was limited to writing small and fixing queries of Amazon Opensearch. Had an in depth understanding of the Architecture which was really a plus to Optimize the reindexing process.

Took around 5 days to :-

create image for following versions in alpine
6.8.23
7.0.0
7.17.9
8.0.0
8.6.2
Due to some alpine bug, it was not able to discover stateful sets pods. So had to redo the image for ubuntu.
Figure out how to reindexing at around 150k rps

Preparing for Upgrade

curl for disabling shard thrashing while upgrading

curl --location --globoff --request PUT 'http://{{url}}/_cluster/settings' \
--header 'Content-Type: application/json' \
--data '{
    "persistent": {
        "cluster.routing.rebalance.enable": "all",
        "cluster.routing.allocation.enable": "primaries"
    }
}'

Curl to make sure no dangling indices are created

curl --location --globoff --request PUT '{{url}}/*/_settings' \
--header 'Content-Type: application/json' \
--data '{
  "index": {
    "blocks.read_only": true
  }
}'

curl --location --globoff --request POST 'http://{{url}}/_flush' \
--header 'Content-Type: application/json' \
--data ''

Tweaks for Faster Indexing

We had 2 data nodes, so having an even no of shard with balancing enabled helped balance load properly between nodes.

curl for index templates

We created index templates for each index by putting custom settings, We found out mmapfs works really fast for us than hybridfs. refresh_interval can be -1 but having 30s helps us to know how many documents have been indexed.

Other settings are to reduce no of merges. And not spend too much time in translog fsync.

curl --location --globoff --request PUT '{{url}}/_template/index_template' \
--header 'Content-Type: application/json' \
--data '{
    "order": 0,
    "index_patterns": [
        "toy_index-*"
    ],
    "settings": {
        "index.merge.policy.max_merge_at_once": 10,
        "index.merge.scheduler.max_thread_count": 10,
        "index.merge.scheduler.max_merge_count": 25,
        "index.merge.policy.floor_segment": "200mb",
        "index.merge.policy.segments_per_tier": 55,
        "index.merge.policy.max_merged_segment": "10gb",
        "index.translog.durability": "async",
        "index.warmer.enabled": "false",
        "index": {
            "refresh_interval": "30s",
            "store": {
                "type": "mmapfs"
            },
            "number_of_shards": "6",
            "number_of_replicas": "0",
            "blocks": {
                "write": "false"
            }
        }
    },
    "aliases": {}
}'

curl for reindexing

Adding scroll, slices and rps.

curl --location --globoff 'http://{{url}}/_reindex?slices=12&wait_for_completion=false&scroll=30m&requests_per_second=100000' \
--header 'Content-Type: application/json' \
--data '{
    "source": {
        "index": "toy_index"
    },
    "dest": {
        "index": "toy_index-reindexed"
    }
}'

Finding out task id for reindexing

curl --location --globoff 'http://{{url}}/_tasks/?detailed=true&actions=indices%3Adata%2Fwrite%2Freindex'

Increasing rps for a task id

There is a rethrottle api to increase existing rps

curl --location --globoff --request POST 'http://{{url}}/_reindex/{parent_task_id}/_rethrottle?requests_per_second=1200000' \
--header 'Content-Type: application/json' \
--data ''

By modifying .yml file we gave around 2gigs of buffer to each indexes and 32 gigs to jvm. Our nodes were spot instances of 64 gigs and 16 core

Apart from ES level changes we also modified node level config

sudo sysctl -w vm.vfs_cache_pressure=50
sudo sysctl -w vm.swappiness=3

Apart from the settings, you have to give a healthy amount of memory to master in a write heavy cluster, or even master can also be a chokepoint

Armed with this experience I was pretty confident that we will be able to migrate pretty easily without incident. And Oh boy I was so so wrong.

Incidents

Small

missed that update api was changed in new version of elastic
missed a custom string analyzer was present in one of the indices
missed [date_histogram] unknown field [interval] did you mean [fixed_interval] which lead

Cluster Fucked

How it happened

We have a cron-job to downscale some of our pods
Pod downscale lead to node downscale
Elasticsearch master was scheduled on same nodes which lead 2 out of 3 ES master being killed at once
Because we deployed es-master with ephemeral storage and not persistent volume, when two pods that was killed, when they came up, they created a new cluster and cluster uuid changed, the old master and data node was not able to connect with the new cluster

How we fixed

command:
    - /bin/bash
    - '-c'
    - '--'
args:
    - while true; do sleep 30; done;

We added above in es-data and removed liveness probe. Then we sshed in pod and `cd bin`

https://www.elastic.co/guide/en/elasticsearch/reference/current/node-tool.html

./elasticsearch-node detach-cluster

then ran the the data node manually by

./run.sh (our custom script to start node)

Did the above for another data node as well.

To restore the dangling indices


GET /_dangling

POST /_dangling/<index-uuid>?accept_data_loss=true

Created new stateful set of master with pvc attached. Then ran the below command for old masters

POST /_cluster/voting_config_exclusions?node_names=old-master-0,old-master-1,old-master-2

Which elected one of the pod from new stateful set

Lessons

Always add pod disruption policy for infra pods ALWAYS.
NEVER ever schedule the infra pods with your general pods

Return on Investment

We reduced 95p for our search requests by a huge 300% brought it under 1.5 seconds. Running on same infra.

Anchor

Discussion about this post