This document provides troubleshooting tips and solutions to common issues related to operating the DC/OS Storage Service and integrating it with other components.
How to monitor the DC/OS Storage Service
Grafana dashboards can provide additional insight into the DC/OS Storage Service, and sample dashboards are built into
the DC/OS monitoring service (dcos-monitoring
) that you can install from the DC/OS catalog. You can download the latest
dashboards from the dashboard repository. The dashboards related to the
DC/OS Storage Service are prefixed with Storage-
.
Additionally, the DC/OS Storage Service generates metrics which can be used to create additional dashboards. All metrics
related to the DC/OS Storage Service have a prefix of csidevices_
, csilvm_
, or dss_
.
How to get the logs of an ‘lvm’ volume provider
The logs of an lvm
volume provider consist of two parts:
-
The last
N
lines of thecsilvm
volume plugin log can be obtained through the following CLI command (assuming thelvm
provider is on nodea221eeb3-b9c0-4e92-ae20-1e1d4af25321-S40
):Or, you can SSH to the node and use
journalctl
to see the full log: -
The Storage Local Resource Provider (SLRP) log, which is part of the Mesos agent log, records the communications between the Mesos agent and the
csilvm
volume plugin. It can be retrieved through:Or alternatively, SSH to the node and run:
I created a ‘devices’ volume provider but it never comes ‘ONLINE’
If a devices
volume provider stays stuck in PENDING
, the following CLI command can provide more details:
If you see the Launching CSI plugin on agent
message as shown above, check following items:
-
Check if the node of the provider (
95f58562-c03f-4e01-808e-9dc3dbf75754-S0
in this example) is reachable from thestorage
task:If it returns a JSON like the following one, the storage task can reach the node:
Otherwise, the cluster’s network is not operational and needs to be resolved first.
-
Check if the service account has all required permissions. The list of required permissions can be found in the install documentation.
-
Check the DC/OS Storage Service log for any error message and investigate what caused the error. The following CLI command shows the last
N
lines of the log:For example, if the service account does not have sufficient permissions, you might see
Access Denied
in the log.
I created an ‘lvm’ volume provider but it never comes ‘ONLINE’
If an lvm
volume provider fails to come online, it typically means that the provider cannot be created due to some
necessary condition not being met. DSS will continue trying to create the provider at regular intervals until it
succeeds or you remove the provider using dcos storage provider remove --name=my-provider-1
. Check the following items:
-
Are the devices specified in the
spec.plugin-configuration.devices
list present in the list when you rundcos storage device list
? Are they on the correct node? -
Is the network operational? Refer to this section to test if the node is reachable from the
storage
task. -
Are the devices mounted or in use by another process on the node?
This troubleshooting example begins with the following provider configuration:
Suppose that creating the provider using the above JSON timed out and now it shows as PENDING
when running dcos storage provider list
.
First, check whether the device in question actually exists on the node.
The problem is that the provider is configured to use xvdx
on agent ...-S40
, but there is no such device on that node. Instead, it should use xvdy
if it wants to run on node ...-S40
.
Next, remove the faulty provider.
Then, fix the JSON and submit the following modified configuration to once more create the provider.
The command timed out again, even though the configuration is using the correct devices.
The next step is to rule out network connectivity problems in the cluster.
If the above command fails, the next step is to investigate whether the cluster’s network is healthy.
If the attempt to SSH to the node succeeds, move to the next step…
DSS launches a provider by writing a Mesos “Resource Provider Configuration” to a file in /var/lib/dcos/mesos/resource-providers
on the node.
Check whether any of the resource provider configurations in that directory relate to the problematic provider:
If none of the resource provider configurations match the provider, it means that DSS did not succeed in instructing Mesos to create the resource provider configuration. Network connectivity, IAM permissions (the DC/OS Service Account that DSS is configured to run with has insufficient permissions) or Mesos issues are all good avenues for further investigation.
However, if a resource provider configuration exists and matches the provider then the Mesos agent will be attempting to launch a CSI plugin for our provider and further investigation revolves around figuring out why it doesn’t succeed.
To see the logs generated by the crashing csilvm
plugin instance, refer
to this section.
How to determine the remaining capacity for each volume profile
The following command shows the capacity of each profile on every node:
I issued a ‘volume create’ but the command timed out and the volume stays stuck in ‘PENDING’
You might see the following error message when issuing the dcos storage volume create
command:
This means that the DC/OS Storage Service is still processing the request
but the CLI has timed out. You can reissue the same command. You can see
your operation and track it’s progress using dcos storage volume list
.
For example, when creating a volume with name my-volume-1
, it will display as PENDING
in the volume list until it has been fully created.
View the current status of the volume in the status.report
field of the volume list
command’s JSON output:
If the volume stays stuck in PENDING
status, check the following steps:
-
Check if all
lvm
providers areONLINE
:If an
lvm
provider is not online, it won’t offer any storage pool to the DC/OS Storage Service. Refer to this section for troubleshootinglvm
providers. -
Check if there is sufficient capacity for the given profile:
Refer to this section to determine if there is a sufficiently large storage pool for the volume. When nodes are not specified at the time volumes are created, the DC/OS Storage Service can suboptimally allocate space among storage pools. As a result, one or more storage pools may become fragmented. To reduce fragmentation, consider specifying the
--node
flag when creating volumes. If a storage pool is not shown as expected, check the agent log for further details. -
Examine the
Storage-Details
Grafana dashboard to look for anomalies in the DC/OS Storage Service.Refer to this section for the Grafana dashboards. Specifically, the
Storage-Details
dashboard monitors how many offers are processed by the DC/OS Storage Service, as well as other health metrics. If there is anything abnormal, the DC/OS Storage Service log may provide more details:
You can issue a volume remove
command to cancel an ongoing volume creation if the DC/OS Storage Service has not
picked an appropriate storage pool yet:
How to find which task uses my volume
The following command shows the reservation of each volume:
In the above example, test-volume-1
is used by the test-app
Marathon app, data-service-volume-1
is used by the
beta-hdfs
data service, and data-service-volume-2
is used by the beta-elastic
data service.
My volume is ‘ONLINE’ but my service does not run
There are a couple possibilities if a service using the volume is not running:
-
No task is ever launched for the service because the volume is not offered to the service. It is possible that the volume has been offered to, and taken by, another task. To determine if another task has taken control of the volume, refer to this section.
-
If the volume has not been taken by any other task but the service still cannot launch the task, check if the task has any placement constraints, and if the volume resides on a node that meets those constraints. If not, recreate the volume on a proper node through the
--node
flag. -
The service launched a task but then the task failed with the following message:
This means that the
csilvm
volume plugin has a problem mounting the volume. To further investigate what leads to the mount failures, refer to this section to analyze the volume plugin log and the SLRP log.
After an agent changes its Mesos ID, some pods are missing in my data service
If the agent loses its metadata (e.g., due to the removal of its /var/lib/mesos/slave/meta/slaves/latest
symlink) and
rejoins the cluster, Mesos will treat it as a new agent and assign a new Mesos ID. As a result, local volumes created on
the agent (with the old Mesos ID) would become stale:
If this happens, here are the steps to bring the data service back online.
-
Recover the
devices
volume provider on the agent (with the old Mesos ID) that is inRECOVERY
:Note that the
devices-1
provider is now associated with the new Mesos ID after recovery. -
Recover all devices on the agent (with the new Mesos ID) that are in
RECOVERY
: -
Recover the
lvm
volume provider on the agent (with the old Mesos ID) that is inRECOVERY
:The
lvm-1
provider is now associated with the new Mesos ID after recovery. -
Remove the stale volume to free up the disk space:
This step will deprovision the volume and clean up the data it stores to ensure no data leakage.
-
Recreate a new volume for the data service:
-
Replace the missing pod so the data service will create a new pod instance to restore data back to the new volume:
I issued a ‘volume remove’ but the command timed out and the volume stays stuck in ‘REMOVING’
You might see the following error message when issuing the dcos storage volume remove
command:
This means that the DC/OS Storage Service is still processing your request but the CLI has timed out. You can see your
operation and track its progress using dcos storage volume list
.
If the volume stays stuck in REMOVING
, it is possible that the volume is being used by another service.
Refer to this section to find out which service is using the volume.
Normally, once the service is removed, the volume should be unreserved, and the DC/OS Storage Service will resume the
volume removal once it receives the unreserved volume.
If the volume is not in use and unreserved, but still stuck in REMOVING
, examine the Storage-Details
Grafana
dashboard to look for anomalies in the DC/OS Storage Service. If there is anything abnormal, the DC/OS Storage Service
log may provide more details:
I cannot remove an ‘lvm’ volume provider
The DC/OS Storage Service cannot remove an lvm
volume provider unless all of its volumes have been removed.
Before removing an lvm
volume provider, you must remove all if its volumes.