Tackling misconfigurations with Datree.io

We are going to learn how to combat K8s misconfigurations with Datree.io and make policy checks easy and automated before deploying to production!🤔

·

11 min read

Tackling misconfigurations with Datree.io

Before we discuss how we solved our problems with the misconfigurations, you should know about what Datree does.

datreeio_b42c0185-f834-11eb-ada0-57b06a8185b6.png

What is Datree?

Datree is a CLI tool that allows you to validate your Kubernetes schema YAML files and scan them for misconfigurations. It also lets you create customized policies and rules using a SaaS dashboard or via a policy-as-code feature. It helps with the security and stability of your Kubernetes configuration. You can also include Datree's policy check in your CI/CD pipeline and run it locally before every commit, before deploying it to production. The CLI tool is open source, enabling it to be supported by the Kubernetes community.

What is Policy-as-code?

Policy-as-code, similar to Infrastructure-as-code, is the concept of using declarative code to replace actions that require using a user interface. By representing policies in code, proven software development best practices can be adopted, such as version control, collaboration, and automation. The way it works is, once the Policy-as-code (PaC) mode is enabled, the only way to change the policies in your account is by publishing a YAML configuration file (policy.yaml) with defined policies.

You can use this feature by turning it on from your profile settings and downloading the policies.yaml file.

policyyaml.PNG

Why did we choose Datree?

Our project was centered around DevOps production best practices alongside my team-mates: Snehomoy Maitra⭐ and Danil Vagapov⭐. We had used the Python Flask API along with the SQlite database for storing the error logs that can be generated during the time of the deployment. This is where Datree comes into the picture as it analyzes the complete YAML files in their entirety and it automatically does the custom policy checks by working as the CI for our K8s pipeline. It also provides a descriptive error message along with the exact error point. So, it was a lot easier for us to store those logs inside of our database without modifying that much, which was more user-friendly. The customized policy integration was pretty flawless, as we just had to specify our custom policies inside of the policy.yaml, pass.yaml and fail.yaml files, and the k8s manifest files were automatically taken as argument by the Datree.

Rules of Datree policies and customization

We had some additional custom policies that we wanted to add to the Datree policies to ensure some more stability and improve security of the Kubernetes manifests. But, before that we wanted to show the default state of custom rules, that is, policy.yaml, which is downloaded from Datree before you customize it.

apiVersion: v1
customRules: null
policies:
  - name: Dev_Policy
    rules:
      # - identifier: CONTAINERS_MISSING_IMAGE_VALUE_VERSION
      #   messageOnFailure: Incorrect value for key `image` - specify an image version to avoid unpleasant "version surprises" in the future
      # - identifier: CONTAINERS_MISSING_MEMORY_REQUEST_KEY
      #   messageOnFailure: Missing property object `requests.memory` - value should be within the accepted boundaries recommended by the organization
      - identifier: CONTAINERS_MISSING_CPU_REQUEST_KEY
        messageOnFailure: Missing property object `requests.cpu` - value should be within the accepted boundaries recommended by the organization
      # - identifier: CONTAINERS_MISSING_MEMORY_LIMIT_KEY
      #   messageOnFailure: Missing property object `limits.memory` - value should be within the accepted boundaries recommended by the organization
      # - identifier: CONTAINERS_MISSING_CPU_LIMIT_KEY
      #   messageOnFailure: Missing property object `limits.cpu` - value should be within the accepted boundaries recommended by the organization
      # - identifier: INGRESS_INCORRECT_HOST_VALUE_PERMISSIVE
      #   messageOnFailure: Incorrect value for key `host` - specify host instead of using a wildcard character ("*")
      # - identifier: SERVICE_INCORRECT_TYPE_VALUE_NODEPORT
      #   messageOnFailure: Incorrect value for key `type` - `NodePort` will open a port on all nodes where it can be reached by the network external to the cluster
      # - identifier: CRONJOB_INVALID_SCHEDULE_VALUE
      #   messageOnFailure: 'Incorrect value for key `schedule` - the (cron) schedule expressions is not valid and, therefore, will not work as expected'
      # - identifier: WORKLOAD_INVALID_LABELS_VALUE
      #   messageOnFailure: Incorrect value for key(s) under `labels` - the vales syntax is not valid so the Kubernetes engine will not accept it
      # - identifier: WORKLOAD_INCORRECT_RESTARTPOLICY_VALUE_ALWAYS
      #   messageOnFailure: Incorrect value for key `restartPolicy` - any other value than `Always` is not supported by this resource
      # - identifier: HPA_MISSING_MINREPLICAS_KEY
      #   messageOnFailure: Missing property object `minReplicas` - the value should be within the accepted boundaries recommended by the organization
      # - identifier: HPA_MISSING_MAXREPLICAS_KEY
      #   messageOnFailure: Missing property object `maxReplicas` - the value should be within the accepted boundaries recommended by the organization

As you can see we have a list of rules under the policies tag. So, here you can add a custom rule by creating a customRules tag that includes the following information:

  • name [OPTIONAL] - this will be the title of the failed policy rule
  • identifier - a unique ID to associate with a policy
  • defaultMessageonFailure - this message will be used when the messageOnFailure tag in the policy.yaml file for a particular rule is empty
  • schema - a custom rule logic written in JSON Schema (as YAML)

Now, to change the policies in your account you will need to update the policies configuration YAML file and publish it.

datree publish policy.yaml

Once a new policy configuration file is published, it will override the existing policies set up in your account.

Now, for our custom policy.yaml , every custom policy rule that you see is a part of Kubernetes Best Production Practices. The custom policy rules are as follows✅:

  • CUSTOM_CONTAINERS_PODS_MISSING_OWNERS
  • CUSTOM_CONTAINERS_MISSING_LIVENESSPROBE
  • CUSTOM_CONTAINERS_MISSING_READINESSPROBE
  • CUSTOM_CONTAINERS_MISSING_IMAGE_TAG
  • CUSTOM_CONTAINERS_MIN_REPLICAS
  • CUSTOM_CONTAINERS_MISSING_PODANTIAFFINITY
  • CUSTOM_CONTAINERS_RESOURCES_REQUESTS_AND_LIMITS
  • CUSTOM_CONTAINERS_RESOURCES_REQUESTS_CPU_BELOW_1000M
  • CUSTOM_CONTAINERS_TECHNICAL_LABELS
  • CUSTOM_CONTAINERS_BUSINESS_LABELS
  • CUSTOM_CONTAINERS_SECURITY_LABELS
  • CUSTOM_CONTAINERS_RESTRICT_ALPHA_BETA

Hence, the policy.yaml:

apiVersion: v1
policies:
  - name: production_best_practices
    isDefault: true
    rules:
      - identifier: CUSTOM_CONTAINERS_PODS_MISSING_OWNERS
        messageOnFailure: ''
      - identifier: CUSTOM_CONTAINERS_MISSING_LIVENESSPROBE
        messageOnFailure: ''
      - identifier: CUSTOM_CONTAINERS_MISSING_READINESSPROBE
        messageOnFailure: ''
      - identifier: CUSTOM_CONTAINERS_MISSING_IMAGE_TAG
        messageOnFailure: ''
      - identifier: CUSTOM_CONTAINERS_MIN_REPLICAS
        messageOnFailure: ''
      - identifier: CUSTOM_CONTAINERS_MISSING_PODANTIAFFINITY
        messageOnFailure: ''
      - identifier: CUSTOM_CONTAINERS_RESOURCES_REQUESTS_AND_LIMITS
        messageOnFailure: ''
      - identifier: CUSTOM_CONTAINERS_RESOURCES_REQUESTS_CPU_BELOW_1000M
        messageOnFailure: ''
      - identifier: CUSTOM_CONTAINERS_TECHNICAL_LABELS
        messageOnFailure: ''
      - identifier: CUSTOM_CONTAINERS_BUSINESS_LABELS
        messageOnFailure: ''
      - identifier: CUSTOM_CONTAINERS_SECURITY_LABELS
        messageOnFailure: ''
      - identifier: CUSTOM_CONTAINERS_RESTRICT_ALPHA_BETA
        messageOnFailure: ''

customRules:
  ## METADATA.OWNERREFERENCES == REQUIRED
  - identifier: CUSTOM_CONTAINERS_PODS_MISSING_OWNERS
    name: Ensure each pod has owner ReplicaSet,StatefulSet or DaemonSet [CUSTOM RULE]
    defaultMessageOnFailure: Delete stand alone Pod
    schema: 
      if:
        properties:
          kind:
            enum:
              - Pod
      then:
        properties:
          metadata:
            properties:
              ownerReferences:
                properties:
                  kind:
                    enum:
                      - ReplicaSet
                      - StatefulSet
                      - DaemonSet
            required:
              - ownerReferences
  ## SPEC.ITEMS.LIVENESSPROBE == REQUIRED
  - identifier: CUSTOM_CONTAINERS_MISSING_LIVENESSPROBE
    name: Ensure each container has a configured liveness probe [CUSTOM RULE]
    defaultMessageOnFailure: Add liveness probe
    schema: 
      definitions:
        specContainers:
          properties:
            spec:
              properties:
                containers:
                  items:
                    required:
                      - livenessProbe
      allOf:
        - $ref: '#/definitions/specContainers'
      additionalProperties:
        $ref: '#'
      items:
        $ref: '#'
  ## SPEC.ITEMS.READINESSPROBE == REQUIRED
  - identifier: CUSTOM_CONTAINERS_MISSING_READINESSPROBE
    name: Ensure each container has a configured readiness probe [CUSTOM RULE]
    defaultMessageOnFailure: Add readinessProbe
    schema: 
      definitions:
        specContainers:
          properties:
            spec:
              properties:
                containers:
                  items:
                    required:
                      - readinessProbe
      allOf:
        - $ref: '#/definitions/specContainers'
      additionalProperties:
        $ref: '#'
      items:
        $ref: '#'
  ## SPEC.ITEMS.IMAGE.TAG != LATEST|EMPTY
  - identifier: CUSTOM_CONTAINERS_MISSING_IMAGE_TAG
    name: Ensure each container image has a pinned (tag) version [CUSTOM RULE]
    defaultMessageOnFailure: Set image version
    schema: 
      definitions:
        specContainers:
          properties:
            spec:
              properties:
                containers:
                  type: array
                  items:
                    properties:
                      image:
                        type: string
                        pattern: ^(?=.*[:|@](?=.+)(?!latest)).*$
      allOf:
        - $ref: '#/definitions/specContainers'
      additionalProperties:
        $ref: '#'
      items:
        $ref: '#'
  ## SPEC.REPLICAS > 1
  - identifier: CUSTOM_CONTAINERS_MIN_REPLICAS
    name: Ensure Deployment or StatefulSet has replicas set between 2-10 [CUSTOM RULE]
    defaultMessageOnFailure: Running 2 or more replicas will increase the availability of the service
    schema:
      if:
        properties:
          kind:
            enum:
              - Deployment
              - StatefulSet
      then:
        properties:
          spec:
            properties:
              replicas:
                minimum: 2
                maximum: 10
            required:
              - replicas
  ## SPEC.AFFINITY.PODANTIAFFINITY == REQUIRED
  - identifier: CUSTOM_CONTAINERS_MISSING_PODANTIAFFINITY
    name: Ensure each container has a podAntiAffinity [CUSTOM RULE]
    defaultMessageOnFailure: You should apply anti-affinity rules to your Deployments and StatefulSet so that Pods are spread in all the nodes of your cluster.
    schema: 
      if:
        properties:
          kind:
            enum:
              - Deployment
              - StatefulSet
      then:
        properties:
          spec:
            properties:
              affinity:
                required:
                  - podAntiAffinity
  ## SPEC.CONTAINERS.RESOURCES [REQUESTS, LIMITS]
  - identifier: CUSTOM_CONTAINERS_RESOURCES_REQUESTS_AND_LIMITS
    name: Ensure each container has a requests and limits resources [CUSTOM RULE]
    defaultMessageOnFailure: An unlimited number of Pods if schedulable on any nodes leading to resource overcommitment and potential node (and kubelet) crashes.
    schema: 
      definitions:
        specContainers:
          properties:
            spec:
              properties:
                containers:
                  items:
                    properties:
                      resources:
                        required:
                          - requests
                          - limits
                    required:
                      - resources
      allOf:
        - $ref: '#/definitions/specContainers'
      additionalProperties:
        $ref: '#'
      items:
        $ref: '#'
  ## SPEC.CONTAINERS.RESOURCES.REQUESTS.CPU <= 1000m
  - identifier: CUSTOM_CONTAINERS_RESOURCES_REQUESTS_CPU_BELOW_1000M
    name: Ensure each container has a configured requests CPU <= 1000m [CUSTOM RULE]
    defaultMessageOnFailure: Unless you have computational intensive jobs, it is recommended to set the request to 1 CPU or below.
    schema: 
      definitions:
        specContainers:
          properties:
            spec:
              properties:
                containers:
                  items:
                    properties:
                      resources:
                        properties:
                          requests:
                            properties:
                              cpu:
                                resourceMaximum: 1000m
      allOf:
        - $ref: '#/definitions/specContainers'
      additionalProperties:
        $ref: '#'
      items:
        $ref: '#'
  ## *.METADATA.LABELS == REQUIRED ALL [name, instance, version, component, part-of, managed-by]
  - identifier: CUSTOM_CONTAINERS_TECHNICAL_LABELS
    name: Ensure each container has technical labels defined [CUSTOM RULE]
    defaultMessageOnFailure: Those labels [name, instance, version, component, part-of, managed-by] are recommended by the official documentation.
    schema: 
      definitions:
        specContainers:
          properties:
            metadata:
              properties:
                labels:
                  required:
                    - app.kubernetes.io/name
                    - app.kubernetes.io/instance
                    - app.kubernetes.io/version
                    - app.kubernetes.io/component
                    - app.kubernetes.io/part-of
                    - app.kubernetes.io/managed-by
      allOf:
        - $ref: '#/definitions/specContainers'
      additionalProperties:
        $ref: '#'
      items:
        $ref: '#'
  ## *.METADATA.LABELS == REQUIRED ALL [owner, project, business-unit]
  - identifier: CUSTOM_CONTAINERS_BUSINESS_LABELS
    name: Ensure each container has business labels defined [CUSTOM RULE]
    defaultMessageOnFailure: You can explore labels and tagging for resources on the AWS tagging strategy page.
    schema: 
      definitions:
        specContainers:
          properties:
            metadata:
              properties:
                labels:
                  required:
                    - owner
                    - project
                    - business-unit
      allOf:
        - $ref: '#/definitions/specContainers'
      additionalProperties:
        $ref: '#'
      items:
        $ref: '#'
  ## *.METADATA.LABELS == REQUIRED ALL [confidentiality, compliance]
  - identifier: CUSTOM_CONTAINERS_SECURITY_LABELS
    name: Ensure each container has security labels defined [CUSTOM RULE]
    defaultMessageOnFailure: You can explore labels and tagging for resources on the AWS tagging strategy page.
    schema: 
      definitions:
        specContainers:
          properties:
            metadata:
              properties:
                labels:
                  required:
                    - confidentiality
                    - compliance
      allOf:
        - $ref: '#/definitions/specContainers'
      additionalProperties:
        $ref: '#'
      items:
        $ref: '#'
  ## APIVERSION != [*beta*, *alpha* ]
  - identifier: CUSTOM_CONTAINERS_RESTRICT_ALPHA_BETA
    name: Ensure each container a restrict access to alpha or beta features [CUSTOM RULE]
    defaultMessageOnFailure: Alpha and beta Kubernetes features are in active development and may have limitations or bugs that result in security vulnerabilities.
    schema: 
      properties:
        apiVersion:
          type: string
          pattern: ^((?!alpha|beta).)*$

Make changes in the test files

In order to check whether the custom policy rules are working or not, we created 2 YAML files pass.yaml and fail.yaml.

For pass.yaml:

$ vi ~/.datree/pass.yaml

This is what the configuration file looks like. In this case, there would not be any rule violation.

apiVersion: apps/v1
kind: Pod
metadata:
  name: pass-policy
  labels:
    app.kubernetes.io/name: pass-policy
    app.kubernetes.io/instance: pass-policy-5fa65d2
    app.kubernetes.io/version: "42"
    app.kubernetes.io/component: api
    app.kubernetes.io/part-of: payment-gateway
    app.kubernetes.io/managed-by: helm
    owner: payment-team
    project: fraud-detection
    business-unit: "80432"
    confidentiality: official
    compliance: pci
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: security
            operator: In
            values:
            - S1
        topologyKey: topology.kubernetes.io/zone
  containers:
  - name: nginx
    image: nginx:1.14.2
    ports:
    - containerPort: 8080
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 3
      periodSeconds: 3
    readinessProbe:
      httpGet:
        path: /readnessprobe
        port: 8080
      initialDelaySeconds: 3
      periodSeconds: 3
    resources:
      limits:
        cpu: 500m
        memory: 4Gi
      requests:
        cpu: 200m
        memory: 2Gi

---
apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  name: test-deploy
  labels:
    app: test-deploy
    app.kubernetes.io/name: test-deploy
    app.kubernetes.io/instance: test-deploy-5fa65d2
    app.kubernetes.io/version: "42"
    app.kubernetes.io/component: api
    app.kubernetes.io/part-of: test-deploy
    app.kubernetes.io/managed-by: kubectl
    owner: payment-team
    project: fraud-detection
    business-unit: "80432"
    confidentiality: official
    compliance: pci
spec:
  replicas: 2
  selector:
    matchLabels:
      app: test-deploy
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: test-deploy
        app.kubernetes.io/name: test-deploy
        app.kubernetes.io/instance: test-deploy-5fa65d2
        app.kubernetes.io/version: "42"
        app.kubernetes.io/component: api
        app.kubernetes.io/part-of: test-deploy
        app.kubernetes.io/managed-by: kubectl
        owner: payment-team
        project: fraud-detection
        business-unit: "80432"
        confidentiality: official
        compliance: pci
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - store
            topologyKey: "kubernetes.io/hostname"
      containers:
      - image: nginx:1.14.2
        name: nginx
        resources:
          limits:
            cpu: 2
            memory: 4Gi
          requests:
            cpu: 500m
            memory: 2Gi
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 3
          periodSeconds: 3
        readinessProbe:
          httpGet:
            path: /readnessprobe
            port: 8080
          initialDelaySeconds: 3
          periodSeconds: 3

---

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
  labels:
    app: nginx
    app.kubernetes.io/name: test-deploy
    app.kubernetes.io/instance: test-deploy-5fa65d2
    app.kubernetes.io/version: "42"
    app.kubernetes.io/component: api
    app.kubernetes.io/part-of: test-deploy
    app.kubernetes.io/managed-by: kubectl
    owner: payment-team
    project: fraud-detection
    business-unit: "80432"
    confidentiality: official
    compliance: pci
spec:
  selector:
    matchLabels:
      app: nginx
  serviceName: "nginx"
  replicas: 3
  template:
    metadata:
      labels:
        app: nginx
        app.kubernetes.io/name: test-deploy
        app.kubernetes.io/instance: test-deploy-5fa65d2
        app.kubernetes.io/version: "42"
        app.kubernetes.io/component: api
        app.kubernetes.io/part-of: test-deploy
        app.kubernetes.io/managed-by: kubectl
        owner: payment-team
        project: fraud-detection
        business-unit: "80432"
        confidentiality: official
        compliance: pci
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - store
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        resources:
          limits:
            cpu: 2
            memory: 4Gi
          requests:
            cpu: 1
            memory: 2Gi
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 3
          periodSeconds: 3
        readinessProbe:
          httpGet:
            path: /readnessprobe
            port: 8080
          initialDelaySeconds: 3
          periodSeconds: 3
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "my-storage-class"
      resources:
        requests:
          storage: 1Gi

For fail.yaml:

$ vi ~/.datree/fail.yaml

This is what the configuration file looks like. In this case, there are some rule violations.

apiVersion: apps/v1
kind: Pod
metadata:
  name: fail-environment-label
  labels:
    environment: qa
    app: test
spec:
  containers:
  - name: nginx
    image: nginx:1.14.2
    ports:
    - containerPort: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: test-deploy
  name: test-deploy
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test-deploy
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: test-deploy
    spec:
      containers:
      - image: nginx
        name: nginx
        resources: {}
---

apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: web
  labels:
    app: nginx
    app.kubernetes.io/name: test-deploy
    app.kubernetes.io/instance: test-deploy-5fa65d2
    confidentiality: official
    compliance: pci
spec:
  selector:
    matchLabels:
      app: nginx
  serviceName: "nginx"
  replicas: 3
  template:
    metadata:
      labels:
        app: nginx
        app.kubernetes.io/name: test-deploy
        app.kubernetes.io/instance: test-deploy-5fa65d2
        confidentiality: official
        compliance: pci
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - store
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        readinessProbe:
          httpGet:
            path: /readnessprobe
            port: 8080
          initialDelaySeconds: 3
          periodSeconds: 3
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "my-storage-class"
      resources:
        requests:
          storage: 1Gi

You can also check the policies through the Datree Dashboard. There you can also filter out your custom rules and check the history page for a detailed list of previous executions.

Conclusion

A lot of our project execution was depended upon Datree and the custom policies that were made. Finding misconfigurations in the Kubernetes manifest files became much easier for us before proceeding to push to production, thanks to Datree.

The takeaway from this blog is to try to experiment with your own custom policies and also test the policy-as-code, as it is a great declarative method to represent your policies.

Co-edited by: Snehomoy Maitra

Resources

  • Datree

Docs

GitHub

Tutorial

  • Our project (DevTool API) Devpost

GitHub

Custom Policies