Elasticsearch, NFS, and locking issues
As I migrated my ELK stack to Docker backed by an NFS share, my Elasticsearch container started spitting out errors about not able to obtain a lock.
As I migrated my ELK stack to Docker backed by an NFS share, my Elasticsearch container started spitting out errors about not able to obtain a lock.
The error I was getting was the following:
prod_elasticsearch.1.xmwr7padq9d0@docker3 | {"type": "server", "timestamp": "2019-07-13T17:21:21,366+0000", "level": "WARN", "component": "o.e.b.ElasticsearchUncaughtExceptionHandler", "cluster.name": "docker-cluster", "node.name": "2f81a7e8e374", "message": "uncaught exception in thread [main]" ,
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "stacktrace": ["org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried [[/usr/share/elasticsearch/data]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:163) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:150) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124) ~[elasticsearch-cli-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-cli-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:115) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "Caused by: java.lang.IllegalStateException: failed to obtain node locks, tried [[/usr/share/elasticsearch/data]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:298) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.node.Node.<init>(Node.java:271) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.node.Node.<init>(Node.java:251) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:221) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:221) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "... 6 more",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "Caused by: java.io.IOException: failed to obtain lock on /usr/share/elasticsearch/data/nodes/0",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.env.NodeEnvironment$NodeLock.<init>(NodeEnvironment.java:220) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:268) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.node.Node.<init>(Node.java:271) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.node.Node.<init>(Node.java:251) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:221) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:221) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "... 6 more",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "Caused by: java.io.IOException: No locks available",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at sun.nio.ch.FileDispatcherImpl.lock0(Native Method) ~[?:?]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:96) ~[?:?]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1161) ~[?:?]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at java.nio.channels.FileChannel.tryLock(FileChannel.java:1165) ~[?:?]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:126) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.env.NodeEnvironment$NodeLock.<init>(NodeEnvironment.java:213) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:268) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.node.Node.<init>(Node.java:271) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.node.Node.<init>(Node.java:251) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:221) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:221) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3 | "... 6 more"] }
Searching around there was a number of suggestions to fix it:
- Check for rogue running Elasticsearch instances (none running)
On each indivudal docker serverps -ef | grep java
- Verify file permissions (they were fine, the node.lock was recreated)
chown -R 1000:1000 /docker/volumes/elasticsearch/data chmod -R 777 /docker/volumes/elasticsearch/data
- Remove the
node.lock
file (didn't fix it)rm /docker/volumes/elasticsearch/data/nodes/0/node.lock
- Restart all docker servers and nfs servers, still no good
- Verify the rpc stuff, it was fine, returning what it should
rpcinfo -u $NFSSERVER status
- Don't use async, force the client to use synchronous IO, I think it was adding
sync
to the options. Still failed. - Make sure the
lock
daemons were running. I don't remember the service names, but they were.
I even enabled trace logging to see if I could find it in there: https://www.frakkingsweet.com/elasticsearch-and-trace-logging-for-everything/
Unfortunately, that was still no help, but at least I know how to do it now. The logs were spitting out the same thing with nothing relevant before the error.
I opted to use NFSv3 at one point in the beginning of this journey, probably because Windows 10 works better with 3 vs 4. Which has to do with the root binding I think. I'll be looking into this problem later. But, that was the problem. NFSv3 doesn't work very well with locks.
My original bind options in my docker-compose.yml
was o: addr=${NFSSERVER},vers=3,rw
.
My NFS server (a Debian system) already had v4 enabled. After changing the vers=3
to vers=4
in my bind options this still fixed it. But remember, updating the volume options and redeploying the stack does not modify an already existing volume. Volumes also don't get removed from the docker swarm nodes when they are no longer in use. I needed to go to each individual node and manually delete the volumes using docker volume rm prod_elasticsearchconfig prod_elasticsearchdata prod_elasticsearchlogs
.
My new docker-compose.yml
file is now:
version: '3.7'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:7.2.0
environment:
- "ES_JAVA_OPTS=-Xms7g -Xmx7g"
- discovery.type=single-node
volumes:
- elasticsearchdata:/usr/share/elasticsearch/data
- elasticsearchconfig:/usr/share/elasticsearch/config
- elasticsearchlogs:/usr/share/elasticsearch/logs
ports:
- 9200:9200
networks:
- ext
volumes:
elasticsearchdata:
driver: local
driver_opts:
type: nfs
device: ":/volumes/elasticsearch/data"
o: "addr=${NFSSERVER},vers=4,rw"
elasticsearchconfig:
driver: local
driver_opts:
type: nfs
device: ":/volumes/elasticsearch/config"
o: "addr=${NFSSERVER},vers=4,rw"
elasticsearchlogs:
driver: local
driver_opts:
type: nfs
device: ":/volumes/elasticsearch/logs"
o: "addr=${NFSSERVER},vers=4,rw"
networks:
ext:
And that's it.
In my opinion, Elasticsearch in Docker is incompatible with NFSv3, but works just fine with NFSv4.