There are in general 2 different scenarios when it comes to indexing.
Either you have to deal with a stream of data, like logs, Twitter Stream, newsfeeds etc. or you have nightly database dumps.
There might be cases where you have both nightly database dumps mixed with tiny delta updates throughout the day.
Elasticsearch is build to being able to handle indexing and searching simultaneously with good performance.
There are threadpools reserved for both searching and indexing. The system will always be balanced. Elasticsearch will always try to distribute the resources fairly for each type of operation.
What does that mean? That by default Elasticsearch is not optimized for super fast indexing.
You have to tune a few parameters and know a few tricks to speed up indexing.
If you have the constant indexing scenario, you have to ask yourself (or your boss) if it’s more important to index data quickly or to search the same data in near realtime. That very much depends on your usecase and application and the SLAs that you have to fulfill.
It will also depend on the load of your cluster. Maybe you have a fairly small cluster that needs to ingest lots of logs and you don’t have a persistent queue to buffer the logs. And if you’re e.g. in a security usecase, you cannot risk to loose a single log event.
In that case you should really try to focus more on fast ingestion and maybe a slightly retarded searchability of your events.
Index refresh interval
Why is it important to know when your documents from your event stream need to be searchable?
One of the most important parameters when it comes to tuning indexing performance is the “index_refresh_interval”.
When you index a document in Elasticsearch it is not immediately searchable. It will be searchable by default after 1 second.
When the index is refreshed a new segment is created in Lucene.
Usually it does not make sense to create a new segment with every document. So Elasticsearch will wait 1 second and write a new segment for all documents that have been indexed, updated or deleted in the meantime. That does not sound very long. Exactly. It’s not very long. The more data you want to squeeze in before the index is refreshed and a new segment will be written the longer this refresh interval should be.
For write-heavy usecases 30 seconds or 60 seconds is a better refresh interval.
If you want to know more about the anatomy of a indexing operation, have a look here.
Bulk indexing
You should always use the Bulk API instead of sending single indexing requests. The HTTP overhead of sending just 1 document to Elasticsearch and have your client send the request and digest the response is massive. Never use the Index API for multiple documents in production. Always use the Bulk API.
How many documents you should put into 1 bulk request is always worth to be performance tested. Start with 100 documents and measure the time that this takes in your client application.
Then continue with 50 documents or 200 and try to find the sweet spot.
Parallelize
In most cases Elasticsearch is not the cause of the problem when you observe that you cannot index enough events per second.
In most cases the client application is not saturating Elasticsearchs indexing capacity!
Try to parallelize your indexing requests.
As with finding the perfect bulk size for your project you need to test and find the ideal number of parallel requests for bulk indexing. Try with 2, 10 and then find the perfect middle there.
Use more primary shards
If you have only 1 shard and you tried all the above and your event rate per second is still below your expectations, you need to increase the number of primary shards. All indexing requests will be evenly distributed by the document routing algorithm. So the more shards you add the less work each single shard has to perform when it comes to indexing.
All we talked about so far can be used to speed up both types of indexing: stream and dump.
Now let’s highlight a few techniques that you can use for short term heavy indexing like in a nightly database dump.
Disable replication
If your nightly dump takes 30 minutes and you know that you have some maintenance time for your system you can briefly disable replication.
Set the “number_of_replicas” of your index to 0 via the update index settings API.
PUT test/_settings
{
"number_of_replicas": 0
}
Once you’re done with the dump, you can re-enable 1 or more replicas.
Why? When you index with replica shards, each document will be first saved in the primary shard and then in all replica shards.
If you add a replica shard, the segments will be copied over via the transport protocol. A segment contains usually more than 1 document that is already digested into all its data-structures. So copying by segment is preferred over copying by document.
That is faster than if each node had to repeat the same work.
That’s about it. The best tips and tricks for tuning indexing speed can be found here.
Of course there are always more performance boosting tricks and parameters to tune, but these are the most important ones.