Tackling misconfigurations with Datree.io
We are going to learn how to combat K8s misconfigurations with Datree.io and make policy checks easy and automated before deploying to production!🤔
Before we discuss how we solved our problems with the misconfigurations, you should know about what Datree does.
What is Datree?
Datree is a CLI tool that allows you to validate your Kubernetes schema YAML files and scan them for misconfigurations. It also lets you create customized policies and rules using a SaaS dashboard or via a policy-as-code feature. It helps with the security and stability of your Kubernetes configuration. You can also include Datree's policy check in your CI/CD pipeline and run it locally before every commit, before deploying it to production. The CLI tool is open source, enabling it to be supported by the Kubernetes community.
What is Policy-as-code?
Policy-as-code, similar to Infrastructure-as-code, is the concept of using declarative code to replace actions that require using a user interface. By representing policies in code, proven software development best practices can be adopted, such as version control, collaboration, and automation. The way it works is, once the Policy-as-code (PaC) mode is enabled, the only way to change the policies in your account is by publishing a YAML configuration file (policy.yaml) with defined policies.
You can use this feature by turning it on from your profile settings and downloading the policies.yaml
file.
Why did we choose Datree?
Our project was centered around DevOps production best practices alongside my team-mates: Snehomoy Maitra⭐ and Danil Vagapov⭐. We had used the Python Flask API along with the SQlite database for storing the error logs that can be generated during the time of the deployment.
This is where Datree comes into the picture as it analyzes the complete YAML files in their entirety and it automatically does the custom policy checks by working as the CI for our K8s pipeline. It also provides a descriptive error message along with the exact error point. So, it was a lot easier for us to store those logs inside of our database without modifying that much, which was more user-friendly. The customized policy integration was pretty flawless, as we just had to specify our custom policies inside of the policy.yaml
, pass.yaml
and fail.yaml
files, and the k8s manifest files were automatically taken as argument by the Datree.
Rules of Datree policies and customization
We had some additional custom policies that we wanted to add to the Datree policies to ensure some more stability and improve security of the Kubernetes manifests.
But, before that we wanted to show the default state of custom rules, that is, policy.yaml
, which is downloaded from Datree before you customize it.
apiVersion: v1
customRules: null
policies:
- name: Dev_Policy
rules:
# - identifier: CONTAINERS_MISSING_IMAGE_VALUE_VERSION
# messageOnFailure: Incorrect value for key `image` - specify an image version to avoid unpleasant "version surprises" in the future
# - identifier: CONTAINERS_MISSING_MEMORY_REQUEST_KEY
# messageOnFailure: Missing property object `requests.memory` - value should be within the accepted boundaries recommended by the organization
- identifier: CONTAINERS_MISSING_CPU_REQUEST_KEY
messageOnFailure: Missing property object `requests.cpu` - value should be within the accepted boundaries recommended by the organization
# - identifier: CONTAINERS_MISSING_MEMORY_LIMIT_KEY
# messageOnFailure: Missing property object `limits.memory` - value should be within the accepted boundaries recommended by the organization
# - identifier: CONTAINERS_MISSING_CPU_LIMIT_KEY
# messageOnFailure: Missing property object `limits.cpu` - value should be within the accepted boundaries recommended by the organization
# - identifier: INGRESS_INCORRECT_HOST_VALUE_PERMISSIVE
# messageOnFailure: Incorrect value for key `host` - specify host instead of using a wildcard character ("*")
# - identifier: SERVICE_INCORRECT_TYPE_VALUE_NODEPORT
# messageOnFailure: Incorrect value for key `type` - `NodePort` will open a port on all nodes where it can be reached by the network external to the cluster
# - identifier: CRONJOB_INVALID_SCHEDULE_VALUE
# messageOnFailure: 'Incorrect value for key `schedule` - the (cron) schedule expressions is not valid and, therefore, will not work as expected'
# - identifier: WORKLOAD_INVALID_LABELS_VALUE
# messageOnFailure: Incorrect value for key(s) under `labels` - the vales syntax is not valid so the Kubernetes engine will not accept it
# - identifier: WORKLOAD_INCORRECT_RESTARTPOLICY_VALUE_ALWAYS
# messageOnFailure: Incorrect value for key `restartPolicy` - any other value than `Always` is not supported by this resource
# - identifier: HPA_MISSING_MINREPLICAS_KEY
# messageOnFailure: Missing property object `minReplicas` - the value should be within the accepted boundaries recommended by the organization
# - identifier: HPA_MISSING_MAXREPLICAS_KEY
# messageOnFailure: Missing property object `maxReplicas` - the value should be within the accepted boundaries recommended by the organization
As you can see we have a list of rules under the policies
tag. So, here you can add a custom rule by creating a customRules
tag that includes the following information:
name
[OPTIONAL] - this will be the title of the failed policy ruleidentifier
- a unique ID to associate with a policydefaultMessageonFailure
- this message will be used when themessageOnFailure
tag in thepolicy.yaml
file for a particular rule is emptyschema
- a custom rule logic written in JSON Schema (as YAML)
Now, to change the policies in your account you will need to update the policies configuration YAML file and publish it.
datree publish policy.yaml
Once a new policy configuration file is published, it will override the existing policies set up in your account.
Now, for our custom policy.yaml
, every custom policy rule that you see is a part of Kubernetes Best Production Practices. The custom policy rules are as follows✅:
- CUSTOM_CONTAINERS_PODS_MISSING_OWNERS
- CUSTOM_CONTAINERS_MISSING_LIVENESSPROBE
- CUSTOM_CONTAINERS_MISSING_READINESSPROBE
- CUSTOM_CONTAINERS_MISSING_IMAGE_TAG
- CUSTOM_CONTAINERS_MIN_REPLICAS
- CUSTOM_CONTAINERS_MISSING_PODANTIAFFINITY
- CUSTOM_CONTAINERS_RESOURCES_REQUESTS_AND_LIMITS
- CUSTOM_CONTAINERS_RESOURCES_REQUESTS_CPU_BELOW_1000M
- CUSTOM_CONTAINERS_TECHNICAL_LABELS
- CUSTOM_CONTAINERS_BUSINESS_LABELS
- CUSTOM_CONTAINERS_SECURITY_LABELS
- CUSTOM_CONTAINERS_RESTRICT_ALPHA_BETA
Hence, the policy.yaml
:
apiVersion: v1
policies:
- name: production_best_practices
isDefault: true
rules:
- identifier: CUSTOM_CONTAINERS_PODS_MISSING_OWNERS
messageOnFailure: ''
- identifier: CUSTOM_CONTAINERS_MISSING_LIVENESSPROBE
messageOnFailure: ''
- identifier: CUSTOM_CONTAINERS_MISSING_READINESSPROBE
messageOnFailure: ''
- identifier: CUSTOM_CONTAINERS_MISSING_IMAGE_TAG
messageOnFailure: ''
- identifier: CUSTOM_CONTAINERS_MIN_REPLICAS
messageOnFailure: ''
- identifier: CUSTOM_CONTAINERS_MISSING_PODANTIAFFINITY
messageOnFailure: ''
- identifier: CUSTOM_CONTAINERS_RESOURCES_REQUESTS_AND_LIMITS
messageOnFailure: ''
- identifier: CUSTOM_CONTAINERS_RESOURCES_REQUESTS_CPU_BELOW_1000M
messageOnFailure: ''
- identifier: CUSTOM_CONTAINERS_TECHNICAL_LABELS
messageOnFailure: ''
- identifier: CUSTOM_CONTAINERS_BUSINESS_LABELS
messageOnFailure: ''
- identifier: CUSTOM_CONTAINERS_SECURITY_LABELS
messageOnFailure: ''
- identifier: CUSTOM_CONTAINERS_RESTRICT_ALPHA_BETA
messageOnFailure: ''
customRules:
## METADATA.OWNERREFERENCES == REQUIRED
- identifier: CUSTOM_CONTAINERS_PODS_MISSING_OWNERS
name: Ensure each pod has owner ReplicaSet,StatefulSet or DaemonSet [CUSTOM RULE]
defaultMessageOnFailure: Delete stand alone Pod
schema:
if:
properties:
kind:
enum:
- Pod
then:
properties:
metadata:
properties:
ownerReferences:
properties:
kind:
enum:
- ReplicaSet
- StatefulSet
- DaemonSet
required:
- ownerReferences
## SPEC.ITEMS.LIVENESSPROBE == REQUIRED
- identifier: CUSTOM_CONTAINERS_MISSING_LIVENESSPROBE
name: Ensure each container has a configured liveness probe [CUSTOM RULE]
defaultMessageOnFailure: Add liveness probe
schema:
definitions:
specContainers:
properties:
spec:
properties:
containers:
items:
required:
- livenessProbe
allOf:
- $ref: '#/definitions/specContainers'
additionalProperties:
$ref: '#'
items:
$ref: '#'
## SPEC.ITEMS.READINESSPROBE == REQUIRED
- identifier: CUSTOM_CONTAINERS_MISSING_READINESSPROBE
name: Ensure each container has a configured readiness probe [CUSTOM RULE]
defaultMessageOnFailure: Add readinessProbe
schema:
definitions:
specContainers:
properties:
spec:
properties:
containers:
items:
required:
- readinessProbe
allOf:
- $ref: '#/definitions/specContainers'
additionalProperties:
$ref: '#'
items:
$ref: '#'
## SPEC.ITEMS.IMAGE.TAG != LATEST|EMPTY
- identifier: CUSTOM_CONTAINERS_MISSING_IMAGE_TAG
name: Ensure each container image has a pinned (tag) version [CUSTOM RULE]
defaultMessageOnFailure: Set image version
schema:
definitions:
specContainers:
properties:
spec:
properties:
containers:
type: array
items:
properties:
image:
type: string
pattern: ^(?=.*[:|@](?=.+)(?!latest)).*$
allOf:
- $ref: '#/definitions/specContainers'
additionalProperties:
$ref: '#'
items:
$ref: '#'
## SPEC.REPLICAS > 1
- identifier: CUSTOM_CONTAINERS_MIN_REPLICAS
name: Ensure Deployment or StatefulSet has replicas set between 2-10 [CUSTOM RULE]
defaultMessageOnFailure: Running 2 or more replicas will increase the availability of the service
schema:
if:
properties:
kind:
enum:
- Deployment
- StatefulSet
then:
properties:
spec:
properties:
replicas:
minimum: 2
maximum: 10
required:
- replicas
## SPEC.AFFINITY.PODANTIAFFINITY == REQUIRED
- identifier: CUSTOM_CONTAINERS_MISSING_PODANTIAFFINITY
name: Ensure each container has a podAntiAffinity [CUSTOM RULE]
defaultMessageOnFailure: You should apply anti-affinity rules to your Deployments and StatefulSet so that Pods are spread in all the nodes of your cluster.
schema:
if:
properties:
kind:
enum:
- Deployment
- StatefulSet
then:
properties:
spec:
properties:
affinity:
required:
- podAntiAffinity
## SPEC.CONTAINERS.RESOURCES [REQUESTS, LIMITS]
- identifier: CUSTOM_CONTAINERS_RESOURCES_REQUESTS_AND_LIMITS
name: Ensure each container has a requests and limits resources [CUSTOM RULE]
defaultMessageOnFailure: An unlimited number of Pods if schedulable on any nodes leading to resource overcommitment and potential node (and kubelet) crashes.
schema:
definitions:
specContainers:
properties:
spec:
properties:
containers:
items:
properties:
resources:
required:
- requests
- limits
required:
- resources
allOf:
- $ref: '#/definitions/specContainers'
additionalProperties:
$ref: '#'
items:
$ref: '#'
## SPEC.CONTAINERS.RESOURCES.REQUESTS.CPU <= 1000m
- identifier: CUSTOM_CONTAINERS_RESOURCES_REQUESTS_CPU_BELOW_1000M
name: Ensure each container has a configured requests CPU <= 1000m [CUSTOM RULE]
defaultMessageOnFailure: Unless you have computational intensive jobs, it is recommended to set the request to 1 CPU or below.
schema:
definitions:
specContainers:
properties:
spec:
properties:
containers:
items:
properties:
resources:
properties:
requests:
properties:
cpu:
resourceMaximum: 1000m
allOf:
- $ref: '#/definitions/specContainers'
additionalProperties:
$ref: '#'
items:
$ref: '#'
## *.METADATA.LABELS == REQUIRED ALL [name, instance, version, component, part-of, managed-by]
- identifier: CUSTOM_CONTAINERS_TECHNICAL_LABELS
name: Ensure each container has technical labels defined [CUSTOM RULE]
defaultMessageOnFailure: Those labels [name, instance, version, component, part-of, managed-by] are recommended by the official documentation.
schema:
definitions:
specContainers:
properties:
metadata:
properties:
labels:
required:
- app.kubernetes.io/name
- app.kubernetes.io/instance
- app.kubernetes.io/version
- app.kubernetes.io/component
- app.kubernetes.io/part-of
- app.kubernetes.io/managed-by
allOf:
- $ref: '#/definitions/specContainers'
additionalProperties:
$ref: '#'
items:
$ref: '#'
## *.METADATA.LABELS == REQUIRED ALL [owner, project, business-unit]
- identifier: CUSTOM_CONTAINERS_BUSINESS_LABELS
name: Ensure each container has business labels defined [CUSTOM RULE]
defaultMessageOnFailure: You can explore labels and tagging for resources on the AWS tagging strategy page.
schema:
definitions:
specContainers:
properties:
metadata:
properties:
labels:
required:
- owner
- project
- business-unit
allOf:
- $ref: '#/definitions/specContainers'
additionalProperties:
$ref: '#'
items:
$ref: '#'
## *.METADATA.LABELS == REQUIRED ALL [confidentiality, compliance]
- identifier: CUSTOM_CONTAINERS_SECURITY_LABELS
name: Ensure each container has security labels defined [CUSTOM RULE]
defaultMessageOnFailure: You can explore labels and tagging for resources on the AWS tagging strategy page.
schema:
definitions:
specContainers:
properties:
metadata:
properties:
labels:
required:
- confidentiality
- compliance
allOf:
- $ref: '#/definitions/specContainers'
additionalProperties:
$ref: '#'
items:
$ref: '#'
## APIVERSION != [*beta*, *alpha* ]
- identifier: CUSTOM_CONTAINERS_RESTRICT_ALPHA_BETA
name: Ensure each container a restrict access to alpha or beta features [CUSTOM RULE]
defaultMessageOnFailure: Alpha and beta Kubernetes features are in active development and may have limitations or bugs that result in security vulnerabilities.
schema:
properties:
apiVersion:
type: string
pattern: ^((?!alpha|beta).)*$
Make changes in the test files
In order to check whether the custom policy rules are working or not, we created 2 YAML files pass.yaml
and fail.yaml
.
For pass.yaml
:
$ vi ~/.datree/pass.yaml
This is what the configuration file looks like. In this case, there would not be any rule violation.
apiVersion: apps/v1
kind: Pod
metadata:
name: pass-policy
labels:
app.kubernetes.io/name: pass-policy
app.kubernetes.io/instance: pass-policy-5fa65d2
app.kubernetes.io/version: "42"
app.kubernetes.io/component: api
app.kubernetes.io/part-of: payment-gateway
app.kubernetes.io/managed-by: helm
owner: payment-team
project: fraud-detection
business-unit: "80432"
confidentiality: official
compliance: pci
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S1
topologyKey: topology.kubernetes.io/zone
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 3
periodSeconds: 3
readinessProbe:
httpGet:
path: /readnessprobe
port: 8080
initialDelaySeconds: 3
periodSeconds: 3
resources:
limits:
cpu: 500m
memory: 4Gi
requests:
cpu: 200m
memory: 2Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
creationTimestamp: null
name: test-deploy
labels:
app: test-deploy
app.kubernetes.io/name: test-deploy
app.kubernetes.io/instance: test-deploy-5fa65d2
app.kubernetes.io/version: "42"
app.kubernetes.io/component: api
app.kubernetes.io/part-of: test-deploy
app.kubernetes.io/managed-by: kubectl
owner: payment-team
project: fraud-detection
business-unit: "80432"
confidentiality: official
compliance: pci
spec:
replicas: 2
selector:
matchLabels:
app: test-deploy
strategy: {}
template:
metadata:
creationTimestamp: null
labels:
app: test-deploy
app.kubernetes.io/name: test-deploy
app.kubernetes.io/instance: test-deploy-5fa65d2
app.kubernetes.io/version: "42"
app.kubernetes.io/component: api
app.kubernetes.io/part-of: test-deploy
app.kubernetes.io/managed-by: kubectl
owner: payment-team
project: fraud-detection
business-unit: "80432"
confidentiality: official
compliance: pci
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
containers:
- image: nginx:1.14.2
name: nginx
resources:
limits:
cpu: 2
memory: 4Gi
requests:
cpu: 500m
memory: 2Gi
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 3
periodSeconds: 3
readinessProbe:
httpGet:
path: /readnessprobe
port: 8080
initialDelaySeconds: 3
periodSeconds: 3
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
labels:
app: nginx
app.kubernetes.io/name: test-deploy
app.kubernetes.io/instance: test-deploy-5fa65d2
app.kubernetes.io/version: "42"
app.kubernetes.io/component: api
app.kubernetes.io/part-of: test-deploy
app.kubernetes.io/managed-by: kubectl
owner: payment-team
project: fraud-detection
business-unit: "80432"
confidentiality: official
compliance: pci
spec:
selector:
matchLabels:
app: nginx
serviceName: "nginx"
replicas: 3
template:
metadata:
labels:
app: nginx
app.kubernetes.io/name: test-deploy
app.kubernetes.io/instance: test-deploy-5fa65d2
app.kubernetes.io/version: "42"
app.kubernetes.io/component: api
app.kubernetes.io/part-of: test-deploy
app.kubernetes.io/managed-by: kubectl
owner: payment-team
project: fraud-detection
business-unit: "80432"
confidentiality: official
compliance: pci
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
resources:
limits:
cpu: 2
memory: 4Gi
requests:
cpu: 1
memory: 2Gi
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 3
periodSeconds: 3
readinessProbe:
httpGet:
path: /readnessprobe
port: 8080
initialDelaySeconds: 3
periodSeconds: 3
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "my-storage-class"
resources:
requests:
storage: 1Gi
For fail.yaml
:
$ vi ~/.datree/fail.yaml
This is what the configuration file looks like. In this case, there are some rule violations.
apiVersion: apps/v1
kind: Pod
metadata:
name: fail-environment-label
labels:
environment: qa
app: test
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
creationTimestamp: null
labels:
app: test-deploy
name: test-deploy
spec:
replicas: 1
selector:
matchLabels:
app: test-deploy
strategy: {}
template:
metadata:
creationTimestamp: null
labels:
app: test-deploy
spec:
containers:
- image: nginx
name: nginx
resources: {}
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: web
labels:
app: nginx
app.kubernetes.io/name: test-deploy
app.kubernetes.io/instance: test-deploy-5fa65d2
confidentiality: official
compliance: pci
spec:
selector:
matchLabels:
app: nginx
serviceName: "nginx"
replicas: 3
template:
metadata:
labels:
app: nginx
app.kubernetes.io/name: test-deploy
app.kubernetes.io/instance: test-deploy-5fa65d2
confidentiality: official
compliance: pci
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
readinessProbe:
httpGet:
path: /readnessprobe
port: 8080
initialDelaySeconds: 3
periodSeconds: 3
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "my-storage-class"
resources:
requests:
storage: 1Gi
You can also check the policies through the Datree Dashboard. There you can also filter out your custom rules and check the history page for a detailed list of previous executions.
Conclusion
A lot of our project execution was depended upon Datree and the custom policies that were made. Finding misconfigurations in the Kubernetes manifest files became much easier for us before proceeding to push to production, thanks to Datree.
The takeaway from this blog is to try to experiment with your own custom policies and also test the policy-as-code, as it is a great declarative method to represent your policies.
Co-edited by: Snehomoy Maitra
Resources
- Datree
- Our project (DevTool API) Devpost