Elasticsearch, NFS, and locking issues

As I migrated my ELK stack to Docker backed by an NFS share, my Elasticsearch container started spitting out errors about not able to obtain a lock.

As I migrated my ELK stack to Docker backed by an NFS share, my Elasticsearch container started spitting out errors about not able to obtain a lock.

The error I was getting was the following:

prod_elasticsearch.1.xmwr7padq9d0@docker3    | {"type": "server", "timestamp": "2019-07-13T17:21:21,366+0000", "level": "WARN", "component": "o.e.b.ElasticsearchUncaughtExceptionHandler", "cluster.name": "docker-cluster", "node.name": "2f81a7e8e374",  "message": "uncaught exception in thread [main]" ,
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "stacktrace": ["org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried [[/usr/share/elasticsearch/data]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:163) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:150) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124) ~[elasticsearch-cli-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-cli-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:115) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "Caused by: java.lang.IllegalStateException: failed to obtain node locks, tried [[/usr/share/elasticsearch/data]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:298) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.node.Node.<init>(Node.java:271) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.node.Node.<init>(Node.java:251) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:221) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:221) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "... 6 more",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "Caused by: java.io.IOException: failed to obtain lock on /usr/share/elasticsearch/data/nodes/0",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.env.NodeEnvironment$NodeLock.<init>(NodeEnvironment.java:220) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:268) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.node.Node.<init>(Node.java:271) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.node.Node.<init>(Node.java:251) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:221) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:221) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "... 6 more",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "Caused by: java.io.IOException: No locks available",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at sun.nio.ch.FileDispatcherImpl.lock0(Native Method) ~[?:?]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:96) ~[?:?]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1161) ~[?:?]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at java.nio.channels.FileChannel.tryLock(FileChannel.java:1165) ~[?:?]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:126) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.env.NodeEnvironment$NodeLock.<init>(NodeEnvironment.java:213) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:268) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.node.Node.<init>(Node.java:271) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.node.Node.<init>(Node.java:251) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:221) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:221) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-7.2.0.jar:7.2.0]",
prod_elasticsearch.1.xmwr7padq9d0@docker3    | "... 6 more"] }

Searching around there was a number of suggestions to fix it:

  • Check for rogue running Elasticsearch instances (none running)
    On each indivudal docker server
    ps -ef | grep java
    
  • Verify file permissions (they were fine, the node.lock was recreated)
    chown -R 1000:1000 /docker/volumes/elasticsearch/data
    chmod -R 777 /docker/volumes/elasticsearch/data
    
  • Remove the node.lock file (didn't fix it)
    rm /docker/volumes/elasticsearch/data/nodes/0/node.lock
    
  • Restart all docker servers and nfs servers, still no good
  • Verify the rpc stuff, it was fine, returning what it should
    rpcinfo -u $NFSSERVER status
    
  • Don't use async, force the client to use synchronous IO, I think it was adding sync to the options. Still failed.
  • Make sure the lock daemons were running. I don't remember the service names, but they were.

I even enabled trace logging to see if I could find it in there: https://www.frakkingsweet.com/elasticsearch-and-trace-logging-for-everything/

Unfortunately, that was still no help, but at least I know how to do it now. The logs were spitting out the same thing with nothing relevant before the error.

I opted to use NFSv3 at one point in the beginning of this journey, probably because Windows 10 works better with 3 vs 4. Which has to do with the root binding I think. I'll be looking into this problem later. But, that was the problem. NFSv3 doesn't work very well with locks.

My original bind options in my docker-compose.yml was o: addr=${NFSSERVER},vers=3,rw.

My NFS server (a Debian system) already had v4 enabled. After changing the vers=3 to vers=4 in my bind options this still fixed it. But remember, updating the volume options and redeploying the stack does not modify an already existing volume. Volumes also don't get removed from the docker swarm nodes when they are no longer in use. I needed to go to each individual node and manually delete the volumes using docker volume rm prod_elasticsearchconfig prod_elasticsearchdata prod_elasticsearchlogs.

My new docker-compose.yml file is now:

version: '3.7'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.2.0
    environment: 
      - "ES_JAVA_OPTS=-Xms7g -Xmx7g"
      - discovery.type=single-node
    volumes:
      - elasticsearchdata:/usr/share/elasticsearch/data
      - elasticsearchconfig:/usr/share/elasticsearch/config
      - elasticsearchlogs:/usr/share/elasticsearch/logs
    ports:
      - 9200:9200
    networks:
      - ext
volumes:
  elasticsearchdata:
    driver: local
    driver_opts:
      type: nfs
      device: ":/volumes/elasticsearch/data"
      o: "addr=${NFSSERVER},vers=4,rw"
  elasticsearchconfig:
    driver: local
    driver_opts:
      type: nfs
      device: ":/volumes/elasticsearch/config"
      o: "addr=${NFSSERVER},vers=4,rw"
  elasticsearchlogs:
    driver: local
    driver_opts:
      type: nfs
      device: ":/volumes/elasticsearch/logs"
      o: "addr=${NFSSERVER},vers=4,rw"
networks:
  ext:

And that's it.

In my opinion, Elasticsearch in Docker is incompatible with NFSv3, but works just fine with NFSv4.