Edit

Share via


Troubleshoot seccomp profile configuration in Azure Kubernetes Service

Secure computing (seccomp) is a Linux kernel feature that restricts the system calls (syscalls) containers can make, enhancing the security of containerized workloads. In Azure Kubernetes Service (AKS), the containerd runtime used by AKS nodes natively supports seccomp. Enabling a seccomp profile might cause AKS workloads to fail because workload critical syscalls are blocked. This article introduces what seccomp profiles are, how they work, and how to troubleshoot them using the open source project Inspektor Gadget.

Background

Syscalls are the interface that allows user space programs to request kernel services. Seccomp profiles specify the syscalls that are allowed or denied for a specific container. AKS supports two values:

  • RuntimeDefault: Use the default seccomp profile given by the runtime.
  • Unconfined: All syscalls are allowed.

To enable seccomp on your AKS node pools, see Secure container access to resources using built-in Linux security features. You can also configure a custom profile to meet your workload's specific needs, see Configure a custom seccomp profile for details.

When using seccomp profiles, it's important to test and validate the effect on your workloads. Some workloads might require a lower number of syscall restrictions than others. This means that if workloads require syscalls that aren't included in the configured profile, they might fail during runtime.

This article shows how to use the open source project Inspektor Gadget to diagnose issues and gain visibility into blocked syscalls.

Symptoms

After you configure AKS workloads to use a seccomp profile, the workloads exit unexpectedly with one of the following errors:

  • permission denied

  • function not implemented

Prerequisites

Troubleshooting checklist

Step 1: Modify your seccomp profile

Create a custom seccomp profile matching the one you're troubleshooting and replace its default action such as SCMP_ACT_ERRNO with SCMP_ACT_LOG to log blocked syscalls instead of failing them.

Your custom seccomp profile might look like this:

{
    "defaultAction": "SCMP_ACT_ALLOW",
    "syscalls": [
      {
        "names": ["acct",
                "add_key",
                "bpf",
                "clock_adjtime",
                "clock_settime",
                "clone",
                "create_module",
                "delete_module",
                "finit_module",
                "get_kernel_syms",
                "get_mempolicy",
                "init_module",
                "ioperm",
                "iopl",
                "kcmp",
                "kexec_file_load",
                "kexec_load",
                "keyctl",
                "lookup_dcookie",
                "mbind",
                "mount",
                "move_pages",
                "nfsservctl",
                "open_by_handle_at",
                "perf_event_open",
                "personality",
                "pivot_root",
                "process_vm_readv",
                "process_vm_writev",
                "ptrace",
                "query_module",
                "quotactl",
                "reboot",
                "request_key",
                "set_mempolicy",
                "setns",
                "settimeofday",
                "stime",
                "swapon",
                "swapoff",
                "sysfs",
                "_sysctl",
                "umount",
                "umount2",
                "unshare",
                "uselib",
                "userfaultfd",
                "ustat",
                "vm86",
                "vm86old"],
        "action": "SCMP_ACT_LOG"
      }
    ]
  }

The article Configure a custom seccomp profile shows how you can apply your custom seccomp profile to your AKS cluster. Alternatively, you can follow these steps:

  1. Get the names of the nodes in your AKS cluster by running the following command:

    kubectl get nodes 
    
  2. Use the kubectl debug command to start a debug pod on the node and make sure the seccomp folder exists and the tar tool is installed (for copying the profile into the node in the next step):

    kubectl debug node/<node-name> -it --image=mcr.microsoft.com/azurelinux/base/core:3.0
    root [ / ]# mkdir -p /host/var/lib/kubelet/seccomp
    root [ / ]# tdnf install -y tar
    
  3. Copy the pod name printed when running the kubectl debug command. It looks like node-debugger-<node-name>-<random-sufix>. It can also be retrieved by listing the pods in the default namespace.

  4. In another terminal, transfer the seccomp profile file to the node directly:

    kubectl cp <the path of the local seccomp profile>/my-profile.json <pod name>:/host/var/lib/kubelet/seccomp/my-profile.json
    

Note

Repeat the preceding steps for each node in the cluster to ensure that the seccomp profile is available on all nodes where your workload might run.

Now, you can modify the seccompProfile specification of the target pod, which should be confined to the recorded syscalls. For example:

apiVersion: v1
kind: Pod
metadata:
  name: default-pod
  labels:
    app: default-pod
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: my-profile.json
  containers:
  - name: test-container
    image: docker.io/library/nginx:latest

Step 2: Install Inspektor Gadget

Inspektor Gadget provides insights into syscalls affecting your containers. To use it, run the following commands to install the gadget kubectl plugin in your host and deploy Inspektor Gadget into the cluster:

kubectl krew install gadget
kubectl gadget deploy

For more information, see How to install Inspektor Gadget in an AKS cluster.

Step 3: Run the audit_seccomp gadget

With Inspektor Gadget installed, start the audit_seccomp gadget using the kubectl gadget run command:

kubectl gadget run audit_seccomp

Step 4: Analyze blocked syscalls

Run your workload using the kubectl apply -f command. Then, the audit_seccomp gadget logs the syscalls that the seccomp profile should block, along with their associated pods, containers, and processes. You can use this information to identify the root causes of workload failures.

For example, if you run the above-mentioned default-pod pod with the my-profile.json profile, the output looks like the following one:

K8S.NODE                           K8S.NAMESPACE K8S.PODNAME  K8S.CONTAINERNAME  COMM                 PID      TID  CODE             SYSCALL
aks-nodepool1-38695788-vmss000002  default       default-pod  test-container     docker-entrypoi  3996610  3996610  SECCOMP_RET_LOG  SYS_CLONE
aks-nodepool1-38695788-vmss000002  default       default-pod  test-container     docker-entrypoi  3996610  3996610  SECCOMP_RET_LOG  SYS_CLONE
aks-nodepool1-38695788-vmss000002  default       default-pod  test-container     docker-entrypoi  3996610  3996610  SECCOMP_RET_LOG  SYS_CLONE
aks-nodepool1-38695788-vmss000002  default       default-pod  test-container     docker-entrypoi  3996610  3996610  SECCOMP_RET_LOG  SYS_CLONE
aks-nodepool1-38695788-vmss000002  default       default-pod  test-container     docker-entrypoi  3996610  3996610  SECCOMP_RET_LOG  SYS_CLONE
aks-nodepool1-38695788-vmss000002  default       default-pod  test-container     10-listen-on-ip  3996628  3996628  SECCOMP_RET_LOG  SYS_CLONE
aks-nodepool1-38695788-vmss000002  default       default-pod  test-container     10-listen-on-ip  3996628  3996628  SECCOMP_RET_LOG  SYS_CLONE
aks-nodepool1-38695788-vmss000002  default       default-pod  test-container     10-listen-on-ip  3996632  3996632  SECCOMP_RET_LOG  SYS_CLONE
aks-nodepool1-38695788-vmss000002  default       default-pod  test-container     10-listen-on-ip  3996632  3996632  SECCOMP_RET_LOG  SYS_CLONE
aks-nodepool1-38695788-vmss000002  default       default-pod  test-container     10-listen-on-ip  3996632  3996632  SECCOMP_RET_LOG  SYS_CLONE
aks-nodepool1-38695788-vmss000002  default       default-pod  test-container     10-listen-on-ip  3996628  3996628  SECCOMP_RET_LOG  SYS_CLONE
aks-nodepool1-38695788-vmss000002  default       default-pod  test-container     10-listen-on-ip  3996628  3996628  SECCOMP_RET_LOG  SYS_CLONE
aks-nodepool1-38695788-vmss000002  default       default-pod  test-container     20-envsubst-on-  3996639  3996639  SECCOMP_RET_LOG  SYS_CLONE
aks-nodepool1-38695788-vmss000002  default       default-pod  test-container     20-envsubst-on-  3996639  3996639  SECCOMP_RET_LOG  SYS_CLONE
aks-nodepool1-38695788-vmss000002  default       default-pod  test-container     20-envsubst-on-  3996641  3996641  SECCOMP_RET_LOG  SYS_CLONE
aks-nodepool1-38695788-vmss000002  default       default-pod  test-container     30-tune-worker-  3996643  3996643  SECCOMP_RET_LOG  SYS_CLONE
aks-nodepool1-38695788-vmss000002  default       default-pod  test-container     nginx            3996610  3996610  SECCOMP_RET_LOG  SYS_CLONE
aks-nodepool1-38695788-vmss000002  default       default-pod  test-container     nginx            3996610  3996610  SECCOMP_RET_LOG  SYS_CLONE

The output indicates that the test-container executes the SYS_CLONE syscall that the seccomp profile should block. With this information, you can determine whether to permit the listed syscalls in your container. If so, adjust the seccomp profile by removing them, which prevents the workload from failing.

Here are some commonly blocked syscalls to watch out for. A more comprehensive list is available in Significant syscalls blocked by default profile.

Blocked syscall Consideration
clock_settime or clock_adjtime If your workload needs accurate time synchronization, ensure this syscall isn't blocked.
add_key or key_ctl If your workload requires key management, these blocked syscalls prevent containers from using the kernel keyring that is used for retaining security data, authentication keys, encryption keys, and other data within the kernel.
clone This syscall prevents the cloning of new namespaces, except for CLONE_NEWUSER. Workloads that depend on creating new namespaces might be affected if this syscall is blocked.
io_uring This syscall is blocked with the move to containerd 2.0. However, it's not blocked in the profile for containerd 1.7.

Next steps

If you encounter issues with your workloads due to blocked syscalls, consider using a custom seccomp profile suited to the specific needs of your application. You can check out the Inspektor Gadget advise_seccomp gadget.

Testing and refining seccomp profiles helps maintain performance and security for AKS workloads. For further assistance, see Secure computing.

Secure container access to resources using built-in Linux security features

Third-party information disclaimer

The third-party products that this article discusses are manufactured by companies that are independent of Microsoft. Microsoft makes no warranty, implied or otherwise, about the performance or reliability of these products.

Third-party contact disclaimer

Microsoft provides third-party contact information to help you find additional information about this topic. This contact information may change without notice. Microsoft does not guarantee the accuracy of third-party contact information.

Contact us for help

If you have questions or need help, create a support request, or ask Azure community support. You can also submit product feedback to Azure feedback community.