Kubernetes Services and Iptables
0 Prerequisites and Prework
This post only focuses on how kubernetes leverages iptables to implement its service mode. Per official doc Services
A Kubernetes Service is an abstraction which defines a logical set of Pods and a policy by which to access them - sometimes called a micro-service.
Put it in simple words, a service represents a TCP or UDP load-balanced service. As it is a load-balanced service, it must use destination NAT(DNAT) to redirect inbound traffics to back-end pods, which relies on iptables from Linux OS to do the job. When a service gets created, kube-proxy(daemonset) will inject a few of iptables chains & rules to agent node. Accessing kubernetes service relies on DNAT, also if source IP is from external network (Not in Pod IP CIDR), source IP will also be SNAT-ed.
This post presumes end user is familiar with iptables, if not, there is a good article explains iptables in details Iptables Tutorial 1.2.2, it is worth of reading.
This post will not cover pod-to-pod communication as it is not in scope, pod to pod communication relies on IP routing, Cluster Networking might provide some information
all containers can communicate with all other containers without NAT
all nodes can communicate with all containers (and vice-versa) without NAT
the IP that a container sees itself as is the same IP that others see it as
Kubernetes has 3 kind of service types, Publishing services - service types
- ClusterIP
- NodePort
- LoadBalancer
In following sections, we will discussed each service one by one
1 Environment
The sample kubernetes cluster's network setup is below
- Pod IP CIDR: 10.244.0.0/16
- Node IP CIDR: 10.244.1.0/24
- Cluster IP CIDR: 10.0.0.0/16
The cluster uses kubenet plug-in to implement the bridge(cbr0) and host-local CNI plugins, agent node's network interfaces should have
#ifconfig -a
cbr0 Link encap:Ethernet HWaddr 82:26:7d:43:26:2c
inet addr:10.244.1.1 Bcast:0.0.0.0 Mask:255.255.255.0
inet6 addr: fe80::8026:7dff:fe43:262c/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:6161105 errors:0 dropped:0 overruns:0 frame:0
TX packets:5554960 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:1023024859 (1.0 GB) TX bytes:6444977005 (6.4 GB)
...
eth0 Link encap:Ethernet HWaddr 00:0d:3a:a2:8d:f5
inet addr:10.240.0.5 Bcast:10.240.255.255 Mask:255.255.0.0
inet6 addr: fe80::20d:3aff:fea2:8df5/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:13829425 errors:0 dropped:0 overruns:0 frame:0
TX packets:10841327 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:10845741411 (10.8 GB) TX bytes:2219795638 (2.2 GB)
...
veth23307616 Link encap:Ethernet HWaddr 5a:e1:28:c8:20:d7
inet6 addr: fe80::58e1:28ff:fec8:20d7/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:668419 errors:0 dropped:0 overruns:0 frame:0
TX packets:700410 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:150126499 (150.1 MB) TX bytes:1593031464 (1.5 GB)
cbr0 is a Linux bridge device and eth0 is an ethernet device. Every time a pod is created, a virtual Ethernet device(vethxxxxxxxx) will be created. That veth device will get connected to bridge device cbr0. As all pods' network interfaces are plugged into same bridge device, all pods in same agent node can communicate with each other.
#brctl show
bridge name bridge id STP enabled interfaces
cbr0 8000.82267d43262c no veth0abdd6ef
veth218b42ec
veth23307616
veth2beb9e80
veth32de6a73
veth3309ba9e
veth3b498f61
veth41ecf4ee
veth44fcd74a
veth63b06052
veth81cd642e
veth8ba701b0
vethc4eba469
vethea582003
vethf248ae55
2 How kubernetes uses iptables for services
Before explain each type of service's implementation in iptables, we need to understand how kubernetes plug itself into iptables.
2.1 Custom chain
To hook into packet filtering and NAT, kubernetes will create a custom chain KUBE-SERVICE from iptables, it will redirect all PREROUTING AND OUTPUT traffics to custom chain KUBE-SERVICE, refer to below
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
...
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
...
PREROUTING chain is used to handle inbound traffics from external network as well as inbound traffics from pod network.
OUTPUT chain is used to handle outbound traffics to external network as well as outbound traffics to pod network.
After using KUBE-SERVICE chain hook into packet filtering and NAT, kubernetes can inspect traffics to its services and apply SNAT/DNAT accordingly.
KUBE-SERVICE chain is used for service type ClusterIP and LoadBalancer, at the end of KUBE-SERVICE chain, it will install another custom chain KUBE-NODEPORTS to handle traffics for a specific service type NodePort, refer to below
-A KUBE-SERVICES -m comment --comment "kubernetes service nodeports; NOTE: this must be the last rule in this chain" -m addrtype --dst-type LOCAL -j KUBE-NODEPORTS
Start from KUBE-SERVICE custom chain, kubernetes will create a few of custom chains and eventually represent a service, for example,
- KUBE-SERVER->KUBE-SVC-XXXXXXXXXXXXXXXX->KUBE-SEP-XXXXXXXXXXXXXXXX represents a ClusterIP service
- KUBE-NODEPORTS->KUBE-SVC-XXXXXXXXXXXXXXXX->KUBE-SEP-XXXXXXXXXXXXXXXX represents a NodePort service
2.2 SNAT
If we dump iptables from agent node, we will find below chains & rules are created for a specific service, those chains & rules are used for external network to service communications, if source IP address is not from Pod IP CIDR, source IP will be SNAT-ed. For details, refer to Using Source IP.
*nat
...
-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.0.0.10/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-MARK-MASQ
...
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
...
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE
For example, if we do a name lookup from agent node against kubernetes DNS service, as the source IP is agent node IP(in our case it's 10.240.0.5), it's not in Pod IP's CIDR, so it will be treated as an external IP
#dig -b 10.240.0.5 @10.0.0.10 www.microsoft.com
; <<>> DiG 9.10.3-P4-Ubuntu <<>> -b 10.240.0.5 @10.0.0.10 www.microsoft.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46238
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1280
;; QUESTION SECTION:
;www.microsoft.com. IN A
;; ANSWER SECTION:
www.microsoft.com. 6 IN CNAME www.microsoft.com-c-3.edgekey.net.
www.microsoft.com-c-3.edgekey.net. 6 IN CNAME www.microsoft.com-c-3.edgekey.net.globalredir.akadns.net.
www.microsoft.com-c-3.edgekey.net.globalredir.akadns.net. 6 IN CNAME e13678.dspb.akamaiedge.net.
e13678.dspb.akamaiedge.net. 6 IN A 23.67.3.108
;; Query time: 2 msec
;; SERVER: 10.0.0.10#53(10.0.0.10)
;; WHEN: Wed Feb 06 06:26:00 UTC 2019
;; MSG SIZE rcvd: 351
IP address 10.240.0.5 will be SNAT-ed to cbr0 IP address 10.244.1.1, refer to TCPDUMP output in below
#tcpdump -i any port 53 -n -e -S
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
...
06:26:00.243425 Out 82:26:7d:43:26:2c ethertype IPv4 (0x0800), length 90: 10.244.1.1.52173 > 10.244.1.58.53: 46238+ [1au] A? www.microsoft.com. (46)
06:26:00.243432 Out 82:26:7d:43:26:2c ethertype IPv4 (0x0800), length 90: 10.244.1.1.52173 > 10.244.1.58.53: 46238+ [1au] A? www.microsoft.com. (46)
...
06:26:00.245375 P ca:44:10:89:95:7a ethertype IPv4 (0x0800), length 395: 10.244.1.58.53 > 10.244.1.1.52173: 46238 4/0/1 CNAME www.microsoft.com-c-3.edgekey.net., CNAME www.microsoft.com-c-3.edgekey.net.globalredir.akadns.net., CNAME e13678.dspb.akamaiedge.net., A 23.67.3.108 (351)
06:26:00.245375 In ca:44:10:89:95:7a ethertype IPv4 (0x0800), length 395: 10.244.1.58.53 > 10.240.0.5.52173: 46238 4/0/1 CNAME www.microsoft.com-c-3.edgekey.net., CNAME www.microsoft.com-c-3.edgekey.net.globalredir.akadns.net., CNAME e13678.dspb.akamaiedge.net., A 23.67.3.108 (351)
If we check netfilter connection tracking table, a corresponding mapping entry will be created similar like below
#conntrack -L | grep 52173
...
udp 17 22 src=10.240.0.5 dst=10.0.0.10 sport=52173 dport=53 src=10.244.1.58 dst=10.244.1.1 sport=53 dport=52173 mark=0 use=1
...
3 ClusterIP
3.1 ClusterIP introduction
Before discuss ClusterIP service in a detailed way, we will deploy two redis replicas by kubectl apply -f redis.yaml
and use the setup to explain 5 types of ClusterIP services by creating each type of service in following sub-sections.
- ClusterIP service
- ClusterIP service with session affinity
- ClusterIP with external IPs
- ClusterIP service without any endpoints
- Headless service
The sample two replicas deployment file is below
#redis.yaml
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: redis
spec:
replicas: 2
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis
ports:
- containerPort: 6379
name: redis
3.2 ClusterIP(redis)
Define a ClusterIP service is pretty simple, below is a template to create a redis ClusterIP service, simply run kubectl apply -f redis-clusterip.yaml
to deploy it
#redis-clusterip.yaml
apiVersion: v1
kind: Service
metadata:
name: redis
spec:
ports:
- port: 6379
selector:
app: redis
Kubernetes should create a service called "redis", and in my cluster setup, its cluster ip address is 10.0.19.85 and it has two endpoints pointing to two redis pods, their IP addresses are 10.244.1.69 and 10.244.1.70.
#kubectl get service redis
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
redis ClusterIP 10.0.19.85 <none> 6379/TCP 3d4h
#kubectl get endpoints redis
NAME ENDPOINTS AGE
redis 10.244.1.69:6379,10.244.1.70:6379 3d4h
ClusterIP service's IP doesn't exist in anywhere, although the IP address is in cluster IP CIDR, but it doesn't link to any process, it's a virutal IP address and kubernetes will register a DNS A record to associate service's DNS name to it.
#nslookup redis.default.svc.cluster.local 10.0.0.10
Server: 10.0.0.10
Address: 10.0.0.10#53
Name: redis.default.svc.cluster.local
Address: 10.0.19.85
ClusterIP service's IP is accessible from external network as well as from pod network. The magic behind it is kube-proxy will create chains & rules from iptables. For example, below is all iptables chains & rules for service "redis"(adjusted for reading)
-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.0.19.85/32 -p tcp -m comment --comment "default/redis: cluster IP" -m tcp --dport 6379 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.0.19.85/32 -p tcp -m comment --comment "default/redis: cluster IP" -m tcp --dport 6379 -j KUBE-SVC-SCFPZ36VFLUNBB47
-A KUBE-SVC-SCFPZ36VFLUNBB47 -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-UH5EYFQKYB24RWKN
-A KUBE-SVC-SCFPZ36VFLUNBB47 -j KUBE-SEP-5MXPM55VLN7O52FQ
-A KUBE-SEP-UH5EYFQKYB24RWKN -s 10.244.1.69/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-UH5EYFQKYB24RWKN -p tcp -m tcp -j DNAT --to-destination 10.244.1.69:6379
-A KUBE-SEP-5MXPM55VLN7O52FQ -s 10.244.1.70/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-5MXPM55VLN7O52FQ -p tcp -m tcp -j DNAT --to-destination 10.244.1.70:6379
Starts from chain KUBE-SERVICES, inbound traffics to that service will be randomly distributed to endpoint 10.244.1.69:6379 or endpoint 10.244.1.70:6379, this is done by iptables DNAT rules -A KUBE-SEP-UH5EYFQKYB24RWKN -p tcp -m tcp -j DNAT --to-destination 10.244.1.69:6379
and -A KUBE-SEP-5MXPM55VLN7O52FQ -p tcp -m tcp -j DNAT --to-destination 10.244.1.70:6379
The load balancing algorithm is provided by iptables module "statistic", in the two redis replicas deployment, "statistic" module will randomly chooses one of two back-end endpoints based on iptables rules below
-A KUBE-SVC-SCFPZ36VFLUNBB47 -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-UH5EYFQKYB24RWKN
-A KUBE-SVC-SCFPZ36VFLUNBB47 -j KUBE-SEP-5MXPM55VLN7O52FQ
"statistic" module matches packets based on some statistic condition. --mode random option plus --probability p Set the probability for a packet to be randomly matched.
Rule -A KUBE-SEP-UH5EYFQKYB24RWKN -s 10.244.1.69/32 -j KUBE-MARK-MASQ
and -A KUBE-SEP-5MXPM55VLN7O52FQ -s 10.244.1.70/32 -j KUBE-MARK-MASQ
are used for hairpin NAT.
A pod can connect to a service IP it serves, looping back to itself. In this case, the Linux bridge implementation stops the packet from being processed because it then has the same input and output interfaces.
The solution to this problem is to enable the hairpin of each port of the virtual bridge where the pods are connected, kubelet has a config flag --hairpin-mode=hairpin-veth
to enable hairpin mode, after enabling it, each veth connected to bridge should have hairpin on
enabled, for example, bridge -d link
should have below results
#bridge -d link
...
5: veth3b498f61 state UP @docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master cbr0 state forwarding priority 32 cost 2
hairpin on guard off root_block off fastleave off learning on flood on
6: veth81cd642e state UP @docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master cbr0 state forwarding priority 32 cost 2
hairpin on guard off root_block off fastleave off learning on flood on
...
And from kubelet log journalctl -u kubelet,
we should find "hairpinMode" is set to "true"
Feb 05 14:37:25 aks-nodepool1-41808012-1 kubelet[1301]: I0205 14:37:25.729928 1301 kubenet_linux.go:254] CNI network config set to {
Feb 05 14:37:25 aks-nodepool1-41808012-1 kubelet[1301]: "cniVersion": "0.1.0",
Feb 05 14:37:25 aks-nodepool1-41808012-1 kubelet[1301]: "name": "kubenet",
Feb 05 14:37:25 aks-nodepool1-41808012-1 kubelet[1301]: "type": "bridge",
Feb 05 14:37:25 aks-nodepool1-41808012-1 kubelet[1301]: "bridge": "cbr0",
Feb 05 14:37:25 aks-nodepool1-41808012-1 kubelet[1301]: "mtu": 1500,
Feb 05 14:37:25 aks-nodepool1-41808012-1 kubelet[1301]: "addIf": "eth0",
Feb 05 14:37:25 aks-nodepool1-41808012-1 kubelet[1301]: "isGateway": true,
Feb 05 14:37:25 aks-nodepool1-41808012-1 kubelet[1301]: "ipMasq": false,
Feb 05 14:37:25 aks-nodepool1-41808012-1 kubelet[1301]: "hairpinMode": true,
Feb 05 14:37:25 aks-nodepool1-41808012-1 kubelet[1301]: "ipam": {
Feb 05 14:37:25 aks-nodepool1-41808012-1 kubelet[1301]: "type": "host-local",
Feb 05 14:37:25 aks-nodepool1-41808012-1 kubelet[1301]: "subnet": "10.244.1.0/24",
Feb 05 14:37:25 aks-nodepool1-41808012-1 kubelet[1301]: "gateway": "10.244.1.1",
Feb 05 14:37:25 aks-nodepool1-41808012-1 kubelet[1301]: "routes": [
Feb 05 14:37:25 aks-nodepool1-41808012-1 kubelet[1301]: { "dst": "0.0.0.0/0" }
Feb 05 14:37:25 aks-nodepool1-41808012-1 kubelet[1301]: ]
Feb 05 14:37:25 aks-nodepool1-41808012-1 kubelet[1301]: }
Feb 05 14:37:25 aks-nodepool1-41808012-1 kubelet[1301]: }
Feb 05 14:37:25 aks-nodepool1-41808012-1 kubelet[1301]: I0205 14:37:25.737835 1301 kubelet_network.go:75] Setting Pod CIDR: -> 10.244.1.0/24
If we get a shell to one of redis containers and use redis-cli to connect to itself, it shouldn't have any problems.
#kubectl exec -it redis-56f8fbc4d-2qxsr -- /bin/bash
root@redis-56f8fbc4d-2qxs:/data# redis-cli -h redis.default -p 6379
redis.default:6379> exit
If we capture a network trace from agent node, we should find both source IP and destination IP are NAT-ed
#tcpdump -i any port 6379 -n -e -S
05:16:06.516040 P da:78:75:98:84:9f ethertype IPv4 (0x0800), length 76: 10.244.1.69.53480 > 10.0.19.85.6379: Flags [S], seq 1600349012, win 29200, options [mss 1460,sackOK,TS val 2700431537 ecr 0,nop,wscale 7], length 0
05:16:06.516066 Out da:78:75:98:84:9f ethertype IPv4 (0x0800), length 76: 10.244.1.1.53480 > 10.244.1.69.6379: Flags [S], seq 1600349012, win 29200, options [mss 1460,sackOK,TS val 2700431537 ecr 0,nop,wscale 7], length 0
05:16:06.516077 P da:78:75:98:84:9f ethertype IPv4 (0x0800), length 76: 10.244.1.69.6379 > 10.244.1.1.53480: Flags [S.], seq 4204290119, ack 1600349013, win 28960, options [mss 1460,sackOK,TS val 2295287925 ecr 2700431537,nop,wscale 7], length 0
05:16:06.516082 Out da:78:75:98:84:9f ethertype IPv4 (0x0800), length 76: 10.0.19.85.6379 > 10.244.1.69.53480: Flags [S.], seq 4204290119, ack 1600349013, win 28960, options [mss 1460,sackOK,TS val 2295287925 ecr 2700431537,nop,wscale 7], length 0
05:16:06.516091 P da:78:75:98:84:9f ethertype IPv4 (0x0800), length 68: 10.244.1.69.53480 > 10.0.19.85.6379: Flags [.], ack 4204290120, win 229, options [nop,nop,TS val 2700431537 ecr 2295287925], length 0
05:16:06.516095 Out da:78:75:98:84:9f ethertype IPv4 (0x0800), length 68: 10.244.1.1.53480 > 10.244.1.69.6379: Flags [.], ack 4204290120, win 229, options [nop,nop,TS val 2700431537 ecr 2295287925], length 0
Mapping entry created from netfilter connection tracking table also shows SNAT & DNAT are applied to both source IP and destination IP
#conntrack -L | grep 53480
...
tcp 6 86396 ESTABLISHED src=10.244.1.69 dst=10.0.19.85 sport=53480 dport=6379 src=10.244.1.69 dst=10.244.1.1 sport=6379 dport=53480 [ASSURED] mark=0 use=1
...
3.3 ClusterIP with session affinity(redis-sa)
Kubernetes supports ClientIP based session affinity, session affinity makes requests from the same client alway get routed back to the same backend server(in kubernetes it is same pod)
Refer to Services
Client-IP based session affinity can be selected by setting
service.spec.sessionAffinity
to "ClientIP" (the default is "None"), and you can set the max session sticky time by setting the fieldservice.spec.sessionAffinityConfig.clientIP.timeoutSeconds
if you have already setservice.spec.sessionAffinity
to "ClientIP" (the default is "10800").
Below is a sample yaml file to create a session affinity redis service, deploy it with kubectl apply -f redis-clusterip-sa.yaml
#redis-clusterip-sa.yaml
apiVersion: v1
kind: Service
metadata:
name: redis-sa
spec:
sessionAffinity: ClientIP
ports:
- port: 6379
selector:
app: redis
#kubectl get service redis-sa
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
redis-sa ClusterIP 10.0.219.234 <none> 6379/TCP 3d5h
#kubectl get endpoints redis-sa
NAME ENDPOINTS AGE
redis-sa 10.244.1.69:6379,10.244.1.70:6379 3d5h
From backend, session affinity is implemented by iptables module "recent", "recent" module allows you to dynamically create a list of IP addresses and then match against that list in a few different ways.
-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.0.219.234/32 -p tcp -m comment --comment "default/redis-sa: cluster IP" -m tcp --dport 6379 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.0.219.234/32 -p tcp -m comment --comment "default/redis-sa: cluster IP" -m tcp --dport 6379 -j KUBE-SVC-YUZPDSCUOF7FG5LD
-A KUBE-SVC-YUZPDSCUOF7FG5LD -m recent --rcheck --seconds 10800 --reap --name KUBE-SEP-6MUUJB4K75LGZXHS --mask 255.255.255.255 --rsource -j KUBE-SEP-6MUUJB4K75LGZXHS
-A KUBE-SVC-YUZPDSCUOF7FG5LD -m recent --rcheck --seconds 10800 --reap --name KUBE-SEP-F5DCISRHJOTG66JA --mask 255.255.255.255 --rsource -j KUBE-SEP-F5DCISRHJOTG66JA
-A KUBE-SVC-YUZPDSCUOF7FG5LD -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-6MUUJB4K75LGZXHS
-A KUBE-SVC-YUZPDSCUOF7FG5LD -j KUBE-SEP-F5DCISRHJOTG66JA
-A KUBE-SEP-6MUUJB4K75LGZXHS -s 10.244.1.69/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-6MUUJB4K75LGZXHS -p tcp -m recent --set --name KUBE-SEP-6MUUJB4K75LGZXHS --mask 255.255.255.255 --rsource -m tcp -j DNAT --to-destination 10.244.1.69:6379
-A KUBE-SEP-F5DCISRHJOTG66JA -s 10.244.1.70/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-F5DCISRHJOTG66JA -p tcp -m recent --set --name KUBE-SEP-F5DCISRHJOTG66JA --mask 255.255.255.255 --rsource -m tcp -j DNAT --to-destination 10.244.1.70:6379
3.4 ClusterIP with external IPs(redis-externalip)
If there are external IPs that route to one or more cluster nodes, Kubernetes services can be exposed on those
externalIPs
. Traffic that ingresses into the cluster with the external IP (as destination IP), on the service port, will be routed to one of the service endpoints.
Below is the sample deployment yaml file for creating "externalIPs", deploy it by using kubectl apply -f redis-externalip.yaml
#redis-externalip.yaml
apiVersion: v1
kind: Service
metadata:
name: redis-externalip
spec:
ports:
- port: 6379
selector:
app: redis
externalIPs:
- 10.240.0.5
The completed iptables chains & rules are below
-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.0.100.145/32 -p tcp -m comment --comment "default/redis-externalip: cluster IP" -m tcp --dport 6379 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.0.100.145/32 -p tcp -m comment --comment "default/redis-externalip: cluster IP" -m tcp --dport 6379 -j KUBE-SVC-PBGWSN7UU5334HUX
-A KUBE-SERVICES -d 10.240.0.5/32 -p tcp -m comment --comment "default/redis-externalip: external IP" -m tcp --dport 6379 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.240.0.5/32 -p tcp -m comment --comment "default/redis-externalip: external IP" -m tcp --dport 6379 -m physdev ! --physdev-is-in -m addrtype ! --src-type LOCAL -j KUBE-SVC-PBGWSN7UU5334HUX
-A KUBE-SERVICES -d 10.240.0.5/32 -p tcp -m comment --comment "default/redis-externalip: external IP" -m tcp --dport 6379 -m addrtype --dst-type LOCAL -j KUBE-SVC-PBGWSN7UU5334HUX
-A KUBE-SVC-PBGWSN7UU5334HUX -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-ADZDP3BM3UNKJ5NH
-A KUBE-SVC-PBGWSN7UU5334HUX -j KUBE-SEP-3ZL3WNQOXRYCSRYM
-A KUBE-SEP-ADZDP3BM3UNKJ5NH -s 10.244.1.76/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-ADZDP3BM3UNKJ5NH -p tcp -m tcp -j DNAT --to-destination 10.244.1.76:6379
-A KUBE-SEP-3ZL3WNQOXRYCSRYM -s 10.244.1.77/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-3ZL3WNQOXRYCSRYM -p tcp -m tcp -j DNAT --to-destination 10.244.1.77:6379
On top of regular service's iptables chains & rules, kube-proxy will create a set of unique rules in KUBE-SERVICES chain, which are below
-A KUBE-SERVICES -d 10.240.0.5/32 -p tcp -m comment --comment "default/redis-externalip: external IP" -m tcp --dport 6379 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.240.0.5/32 -p tcp -m comment --comment "default/redis-externalip: external IP" -m tcp --dport 6379 -m physdev ! --physdev-is-in -m addrtype ! --src-type LOCAL -j KUBE-SVC-PBGWSN7UU5334HUX
-A KUBE-SERVICES -d 10.240.0.5/32 -p tcp -m comment --comment "default/redis-externalip: external IP" -m tcp --dport 6379 -m addrtype --dst-type LOCAL -j KUBE-SVC-PBGWSN7UU5334HUX
It leverages physdev to identify traffics from physical ethernet device and perform NAT.
physdev
This module matches on the bridge port input and output devices enslaved to a bridge device. This module is a part of the infrastructure that enables a transparent bridging IP firewall and is only useful for kernel versions above version 2.5.44.
--physdev-is-in
Matches if the packet has entered through a bridge interface.
If we try to connect to the external IP from redis client, we should connect it successfully
#redis-cli -h 10.240.0.5 -p 6379
10.240.0.5:6379> exit
Capturing network trace should show how the traffics get relayed
#tcpdump -i any port 6379 -n -e -S
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
11:52:17.448530 In 70:e4:22:93:bd:bf ethertype IPv4 (0x0800), length 76: 10.188.0.4.43534 > 10.240.0.5.6379: Flags [S], seq 726099076, win 29200, options [mss 1418,sackOK,TS val 4124557645 ecr 0,nop,wscale 7], length 0
11:52:17.448744 Out 82:26:7d:43:26:2c ethertype IPv4 (0x0800), length 76: 10.244.1.1.43534 > 10.244.1.77.6379: Flags [S], seq 726099076, win 29200, options [mss 1418,sackOK,TS val 4124557645 ecr 0,nop,wscale 7], length 0
11:52:17.448748 Out 82:26:7d:43:26:2c ethertype IPv4 (0x0800), length 76: 10.244.1.1.43534 > 10.244.1.77.6379: Flags [S], seq 726099076, win 29200, options [mss 1418,sackOK,TS val 4124557645 ecr 0,nop,wscale 7], length 0
11:52:17.448761 P 5a:57:73:30:35:17 ethertype IPv4 (0x0800), length 76: 10.244.1.77.6379 > 10.244.1.1.43534: Flags [S.], seq 2175890393, ack 726099077, win 28960, options [mss 1460,sackOK,TS val 358432586 ecr 4124557645,nop,wscale 7], length 0
11:52:17.448767 In 5a:57:73:30:35:17 ethertype IPv4 (0x0800), length 76: 10.244.1.77.6379 > 10.188.0.4.43534: Flags [S.], seq 2175890393, ack 726099077, win 28960, options [mss 1460,sackOK,TS val 358432586 ecr 4124557645,nop,wscale 7], length 0
11:52:17.448773 Out 00:0d:3a:a2:8d:f5 ethertype IPv4 (0x0800), length 76: 10.240.0.5.6379 > 10.188.0.4.43534: Flags [S.], seq 2175890393, ack 726099077, win 28960, options [mss 1460,sackOK,TS val 358432586 ecr 4124557645,nop,wscale 7], length 0
11:52:17.449460 In 70:e4:22:93:bd:bf ethertype IPv4 (0x0800), length 68: 10.188.0.4.43534 > 10.240.0.5.6379: Flags [.], ack 2175890394, win 229, options [nop,nop,TS val 4124557646 ecr 358432586], length 0
11:52:17.449471 Out 82:26:7d:43:26:2c ethertype IPv4 (0x0800), length 68: 10.244.1.1.43534 > 10.244.1.77.6379: Flags [.], ack 2175890394, win 229, options [nop,nop,TS val 4124557646 ecr 358432586], length 0
11:52:17.449473 Out 82:26:7d:43:26:2c ethertype IPv4 (0x0800), length 68: 10.244.1.1.43534 > 10.244.1.77.6379: Flags [.], ack 2175890394, win 229, options [nop,nop,TS val 4124557646 ecr 358432586], length 0
Mapping entry created from netfilter connection tracking table also shows SNAT & DNAT are applied to both source IP and destination IP
# conntrack -L|grep 6379
...
tcp 6 86387 ESTABLISHED src=10.188.0.4 dst=10.240.0.5 sport=43534 dport=6379 src=10.244.1.77 dst=10.244.1.1 sport=6379 dport=43534 [ASSURED] mark=0 use=1
Also, kube-proxy will reserver a listening port on host node although it is iptables to do the translation.
#lsof -i :6379
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
hyperkube 4397 root 6u IPv4 21849875 0t0 TCP 10.240.0.5:6379 (LISTEN)
3.5 ClusterIP without any endpoints(redis-none)
A ClusterIP service is always associated with backend pods, it uses "selector" to select backend pods, if backend pods are found based on selector, kubernetes will create a endpoint object to map to pod's IP:Port, otherwise, that service will not have any endpoints.
For example, if we deploy a ClusterIP service by kubectl apply -f redis-clusterip-none.yaml
with selector "app: redis-none"
#redis-clusterip-none.yaml
apiVersion: v1
kind: Service
metadata:
name: redis-none
spec:
ports:
- port: 6379
selector:
app: redis-none
Since no pod has a label called "app: redis-none", so its ENDPOINTS is <none>
#kubectl get service redis-none
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
redis-none ClusterIP 10.0.8.126 <none> 6379/TCP 3d5h
#kubectl get endpoints redis-none
NAME ENDPOINTS AGE
redis-none <none> 3d5h
Correspondingly, kubernetes will create a rule in iptables' chain KUBE-SERVICE to reject inbound traffics targeted to taht service with ICMP port unreachable message
-A KUBE-SERVICES -d 10.0.8.126/32 -p tcp -m comment --comment "default/redis-none: has no endpoints" -m tcp --dport 6379 -j REJECT --reject-with icmp-port-unreachable
If we try to access that service from command line redis-cli -h 10.0.8.126 -p 6379
, captured trace will show "ICMP 10.0.8.126 tcp port 6379 unreachable" message is sent back.
#tcpdump -i any port 6379 or icmp -n -e
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
07:31:08.853390 In 00:00:00:00:00:00 ethertype IPv4 (0x0800), length 104: 10.240.0.5 > 10.240.0.5: ICMP 10.0.8.126 tcp port 6379 unreachable, length 68
07:31:09.878729 In 00:00:00:00:00:00 ethertype IPv4 (0x0800), length 104: 10.240.0.5 > 10.240.0.5: ICMP 10.0.8.126 tcp port 6379 unreachable, length 68
Note: Note the MAC address in above tcpdump output is 00:00:00:00:00:00 which basically means the packet is never sent to any network device.
3.6 Headless(redis-headless)
Refer to Headless services,
Sometimes you don’t need or want load-balancing and a single service IP. In this case, you can create “headless” services by specifying
"None"
for the cluster IP (.spec.clusterIP
).This option allows developers to reduce coupling to the Kubernetes system by allowing them freedom to do discovery their own way. Applications can still use a self-registration pattern and adapters for other discovery systems could easily be built upon this API.
If we deploy a headless service by kubectl apply -f redis-clusterip-headless.yaml
#redis-clusterip-headless.yaml
apiVersion: v1
kind: Service
metadata:
name: redis-headless
spec:
clusterIP: None
ports:
- port: 6379
selector:
app: redis
We could still find service and endpoints are created for headless service in kubernetes
#kubectl get service redis-headless
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
redis-headless ClusterIP None <none> 6379/TCP 3d5h
#kubectl get endpoints redis-headless
NAME ENDPOINTS AGE
redis-headless 10.244.1.70:6379,10.244.1.72:6379 3d5h
However, from iptables, no any iptables chains/rules is created for headless service, checking DNS records for headless service shows it uses pod's IP addresses directly.
#nslookup redis-headless.default.svc.cluster.local 10.0.0.10
Server: 10.0.0.10
Address: 10.0.0.10#53
Name: redis-headless.default.svc.cluster.local
Address: 10.244.1.69
Name: redis-headless.default.svc.cluster.local
Address: 10.244.1.70
Hence, it concludes inbound traffics to headless service will directly go to pod's IP and no DNAT is applied.
4 NodePort
4.1 NodePort introduction
This section will discuss NodePort service, to demo some concepts in NodePort service, we will deploy one redis replica by kubectl apply -f redis.yaml
to two kubernetes agent nodes aks-nodepool1-41808012-1 and aks-nodepool1-41808012-2
The two kubernetes nodes settings are below
#kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
aks-nodepool1-41808012-1 Ready agent 9d v1.12.4 10.240.0.5 <none> Ubuntu 16.04.5 LTS 4.15.0-1036-azure docker://3.0.2
aks-nodepool1-41808012-2 Ready agent 17m v1.12.4 10.240.0.4 <none> Ubuntu 16.04.5 LTS 4.15.0-1035-azure docker://3.0.1
We will discuss 5 types of NodePort service by creating each type of service in following sub-sections.
- NodePort service
- NodePort service with externalTrafficPolicy: Local
- NodePort service without any endpoints
- NodePort service with session affinity
- NodePort service with externalTrafficPolicy: Local and session affinity
The corresponding redis deployment file is below, deploy it with kubectl apply -f redis.yaml
#redis.yaml
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: redis
spec:
replicas: 1
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis
ports:
- containerPort: 6379
name: redis
4.2 NodePort(redis-nodeport)
NodePort is a specific service type in kubernetes, it allocates a port(specified in nodePort from service yaml file) in each agent node.
the Kubernetes master will allocate a port from a range specified by
--service-node-port-range
flag (default: 30000-32767), each Node will proxy that port (the same port number on every Node) into yourService
Below is a sample template to deploy NodePort service, deploy it with kubectl apply -f redis-nodeport.yaml
will create a NodePort service on each agent node.
#redis-nodeport.yaml
apiVersion: v1
kind: Service
metadata:
name: redis-nodeport
spec:
type: NodePort
ports:
- nodePort: 30001
port: 6379
targetPort: 6379
selector:
app: redis
NodePort: Exposes the service on each Node’s IP at a static port (the NodePort). A ClusterIP service, to which the NodePort service will route, is automatically created. You’ll be able to contact the NodePort service, from outside the cluster, by requesting :.
When create a NodePort service, implicitly a ClusterIP service is created as well. For example, kubectl get service redis-nodeport
shows a CLUSTER-IP for NodePort service.
#kubectl get service redis-nodeport
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
redis-nodeport NodePort 10.0.118.143 <none> 6379:30001/TCP 107s
#kubectl get endpoints redis-nodeport
NAME ENDPOINTS AGE
redis-nodeport 10.244.0.4:6379 110s
From iptables, we can also find kube-proxy adds two sets of iptables chains & rules, KUBE-SERVICES is for ClusterIP and KUBE-NODEPORTS is for NodePort.
-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.0.118.143/32 -p tcp -m comment --comment "default/redis-nodeport: cluster IP" -m tcp --dport 6379 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.0.118.143/32 -p tcp -m comment --comment "default/redis-nodeport: cluster IP" -m tcp --dport 6379 -j KUBE-SVC-AEB3C6SA5VDECMX3
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/redis-nodeport:" -m tcp --dport 30001 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/redis-nodeport:" -m tcp --dport 30001 -j KUBE-SVC-AEB3C6SA5VDECMX3
-A KUBE-SVC-AEB3C6SA5VDECMX3 -j KUBE-SEP-D5FBLP7RKYOSCD7A
-A KUBE-SEP-D5FBLP7RKYOSCD7A -s 10.244.0.4/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-D5FBLP7RKYOSCD7A -p tcp -m tcp -j DNAT --to-destination 10.244.0.4:6379
Although iptables is leveraged to redirect inbound traffics for NodePort service, kube-proxy process still needs to allocate a listening port in each agent node, this is to make sure no other applications is listening on same port
#lsof -i :30001
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
hyperkube 4397 root 6u IPv6 2124912 0t0 TCP *:30001 (LISTEN)
#ps -p 4397 -o args
COMMAND
/hyperkube proxy --kubeconfig=/var/lib/kubelet/kubeconfig --cluster-cidr=10.244.0.0/16 --feature-gates=ExperimentalCriticalPodAnnotation=true
Refer to source code claimNodePort, the purpose of doing that
Hold the actual port open, even though we use iptables to redirect it. This ensures that a) it's safe to take and b) that stays true.
NodePort service can be accessed from cluster IP as well as NodeIP:NodePort to each agent node.
#redis-cli -h 10.0.118.143 -p 6379
10.0.118.143:6379> exit
#redis-cli -h 10.240.0.4 -p 30001
10.240.0.4:30001> exit
#redis-cli -h 10.240.0.5 -p 30001
10.240.0.5:30001> exit
Even we only have one replica running from aks-nodepool1-41808012-2 agent node.
#kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
...
redis-56f8fbc4d-2rkr2 1/1 Running 0 8m55s 10.244.0.4 aks-nodepool1-41808012-2 <none>
What happens when connect to "aks-nodepool1-41808012-1"(without any local endpoints) withredis-cli -h 10.240.0.5 -p 30001
is
- Client(10.188.0.4) sends a TCP SYN packet from port 58306 to aks-nodepool1-41808012-1 NodePort 10.240.0.5:30001
- iptables rule
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/redis-nodeport:" -m tcp --dport 30001 -j KUBE-MARK-MASQ
SNAT source IP:Port from 10.188.0.4:58306 to 10.240.0.5:58306 - iptables rule
-A KUBE-SEP-D5FBLP7RKYOSCD7A -p tcp -m tcp -j DNAT --to-destination 10.244.0.4:6379
DNAT destination IP:Port from 10.240.0.5:30001 to 10.244.0.4:6379 - As destination IP address is 10.244.0.4 now and it's the IP address of pod redis-56f8fbc4d-2rkr2 in aks-nodepool1-41808012-2, so aks-nodepool1-41808012-1 forwards this packet to aks-nodepool1-41808012-2, aks-nodepool1-41808012-2 further forwards this packet to pod redis-56f8fbc4d-2rkr2
- After redis-56f8fbc4d-2rkr2 ACKs this TCP SYN request, the ACK packet uses same path in reverse order and sends it back to Client(10.188.0.4)
#tcpdump -i any port 30001 or 6379 -n -e -S
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
13:12:53.319162 In 70:e4:22:93:bd:bf ethertype IPv4 (0x0800), length 76: 10.188.0.4.58306 > 10.240.0.5.30001: Flags [S], seq 2120389110, win 29200, options [mss 1418,sackOK,TS val 4042993518 ecr 0,nop,wscale 7], length 0
13:12:53.319205 Out 00:0d:3a:a2:8d:f5 ethertype IPv4 (0x0800), length 76: 10.240.0.5.58306 > 10.244.0.4.6379: Flags [S], seq 2120389110, win 29200, options [mss 1418,sackOK,TS val 4042993518 ecr 0,nop,wscale 7], length 0
13:12:53.320214 In 70:e4:22:93:bd:bf ethertype IPv4 (0x0800), length 76: 10.244.0.4.6379 > 10.240.0.5.58306: Flags [S.], seq 3889872998, ack 2120389111, win 28960, options [mss 1418,sackOK,TS val 3633445287 ecr 4042993518,nop,wscale 7], length 0
13:12:53.320240 Out 00:0d:3a:a2:8d:f5 ethertype IPv4 (0x0800), length 76: 10.240.0.5.30001 > 10.188.0.4.58306: Flags [S.], seq 3889872998, ack 2120389111, win 28960, options [mss 1418,sackOK,TS val 3633445287 ecr 4042993518,nop,wscale 7], length 0
13:12:53.320864 In 70:e4:22:93:bd:bf ethertype IPv4 (0x0800), length 68: 10.188.0.4.58306 > 10.240.0.5.30001: Flags [.], ack 3889872999, win 229, options [nop,nop,TS val 4042993520 ecr 3633445287], length 0
13:12:53.320876 Out 00:0d:3a:a2:8d:f5 ethertype IPv4 (0x0800), length 68: 10.240.0.5.58306 > 10.244.0.4.6379: Flags [.], ack 3889872999, win 229, options [nop,nop,TS val 4042993520 ecr 3633445287], length 0
NodePort from aks-nodepool1-41808012-1 acts as a gateway relays packets to/from redis pod. Mapping entry from conntrack table also shows SNAT/DNAT are applied.
#conntrack -L | grep 30001
tcp 6 86395 ESTABLISHED src=10.188.0.4 dst=10.240.0.5 sport=58306 dport=30001 src=10.244.0.4 dst=10.240.0.5 sport=6379 dport=58306 [ASSURED] mark=0 use=1
conntrack v1.4.3 (conntrack-tools): 859 flow entries have been shown.
4.3 NodePort with "externalTrafficPolicy: Local"(redis-nodeport-local)
Refer to Using Source IP
Kubernetes has a feature to preserve the client source IP (check here for feature availability). Setting
service.spec.externalTrafficPolicy
to the valueLocal
will only proxy requests to local endpoints, never forwarding traffic to other nodes and thereby preserving the original source IP address. If there are no local endpoints, packets sent to the node are dropped, so you can rely on the correct source-ip in any packet processing rules you might apply a packet that make it through to the endpoint.
Using "externalTrafficPolicy: Local" will preserve source IP and drop packets from agent node has no local endpoint.
For example, if we deploy a NodePort service by kubectl apply -f redis-nodeport-local.yaml
#redis-nodeport-local.yaml
apiVersion: v1
kind: Service
metadata:
name: redis-nodeport-local
spec:
externalTrafficPolicy: Local
type: NodePort
ports:
- nodePort: 30002
port: 6379
selector:
app: redis
The endpoint redis-nodeport-local points to pod IP:Port 10.244.0.0.4:6379 which exists in aks-nodepool1-41808012-2
#kubectl get service redis-nodeport-local
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
redis-nodeport-local NodePort 10.0.178.235 <none> 6379:30002/TCP 11h
#kubectl get endpoints redis-nodeport-local
NAME ENDPOINTS AGE
redis-nodeport-local 10.244.0.4:6379 11h
If we check iptables chains & rules for service "redis-nodeport-local" from agent node aks-nodepool1-41808012-1, as there is no local endpoint, rule-A KUBE-XLB-5ETJGCYWCRYR2EKE -m comment --comment "default/redis-nodeport-local: has no local endpoints" -j KUBE-MARK-DROP
drops any packets to NodePort service of agent node aks-nodepool1-41808012-1.
-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.0.178.235/32 -p tcp -m comment --comment "default/redis-nodeport-local: cluster IP" -m tcp --dport 6379 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.0.178.235/32 -p tcp -m comment --comment "default/redis-nodeport-local: cluster IP" -m tcp --dport 6379 -j KUBE-SVC-5ETJGCYWCRYR2EKE
-A KUBE-NODEPORTS -s 127.0.0.0/8 -p tcp -m comment --comment "default/redis-nodeport-local:" -m tcp --dport 30002 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/redis-nodeport-local:" -m tcp --dport 30002 -j KUBE-XLB-5ETJGCYWCRYR2EKE
-A KUBE-XLB-5ETJGCYWCRYR2EKE -s 10.244.0.0/16 -m comment --comment "Redirect pods trying to reach external loadbalancer VIP to clusterIP" -j KUBE-SVC-5ETJGCYWCRYR2EKE
-A KUBE-XLB-5ETJGCYWCRYR2EKE -m comment --comment "default/redis-nodeport-local: has no local endpoints" -j KUBE-MARK-DROP
-A KUBE-SVC-5ETJGCYWCRYR2EKE -j KUBE-SEP-OV7T5FIR3LL6QKS4
-A KUBE-SEP-OV7T5FIR3LL6QKS4 -s 10.244.0.4/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-OV7T5FIR3LL6QKS4 -p tcp -m tcp -j DNAT --to-destination 10.244.0.4:6379
On the contrary, as redis pod is running at agent node aks-nodepool1-41808012-2,it has local endpoint, and hence iptables rule-A KUBE-XLB-5ETJGCYWCRYR2EKE -m comment --comment "Balancing rule 0 for default/redis-nodeport-local:" -j KUBE-SEP-OV7T5FIR3LL6QKS4
will DNAT inbound traffics to local endpoint.
-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.0.178.235/32 -p tcp -m comment --comment "default/redis-nodeport-local: cluster IP" -m tcp --dport 6379 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.0.178.235/32 -p tcp -m comment --comment "default/redis-nodeport-local: cluster IP" -m tcp --dport 6379 -j KUBE-SVC-5ETJGCYWCRYR2EKE
-A KUBE-NODEPORTS -s 127.0.0.0/8 -p tcp -m comment --comment "default/redis-nodeport-local:" -m tcp --dport 30002 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/redis-nodeport-local:" -m tcp --dport 30002 -j KUBE-XLB-5ETJGCYWCRYR2EKE
-A KUBE-XLB-5ETJGCYWCRYR2EKE -s 10.244.0.0/16 -m comment --comment "Redirect pods trying to reach external loadbalancer VIP to clusterIP" -j KUBE-SVC-5ETJGCYWCRYR2EKE
-A KUBE-XLB-5ETJGCYWCRYR2EKE -m comment --comment "Balancing rule 0 for default/redis-nodeport-local:" -j KUBE-SEP-OV7T5FIR3LL6QKS4
-A KUBE-SVC-5ETJGCYWCRYR2EKE -j KUBE-SEP-OV7T5FIR3LL6QKS4
-A KUBE-SEP-OV7T5FIR3LL6QKS4 -s 10.244.0.4/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-OV7T5FIR3LL6QKS4 -p tcp -m tcp -j DNAT --to-destination 10.244.0.4:6379
4.4 NodePort without any endpoints(redis-nodeport-none)
NodePort without any endpoints is similar like ClusterIP without any endpoints, for example, if we deploy a NodePort service by kubectl apply -f redis-nodeport-none.yaml
#redis-nodeport-none.yaml
apiVersion: v1
kind: Service
metadata:
name: redis-nodeport-none
spec:
type: NodePort
ports:
- nodePort: 30003
port: 6379
selector:
app: redis-none
It still creates ClusterIP and NodePort services but without any ENDPOINTS
#kubectl get service redis-nodeport-none
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
redis-nodeport-none NodePort 10.0.109.24 <none> 6379:30003/TCP 16h
#kubectl get endpoints redis-nodeport-none
NAME ENDPOINTS AGE
redis-nodeport-none <none> 16h
Two iptables rules will be created,
-A KUBE-EXTERNAL-SERVICES -p tcp -m comment --comment "default/redis-nodeport-none: has no endpoints" -m addrtype --dst-type LOCAL -m tcp --dport 30003 -j REJECT --reject-with icmp-port-unreachable
is for NodePort service and will response with "icmp port unreachable" to NodeIP:Port-A KUBE-SERVICES -d 10.0.109.24/32 -p tcp -m comment --comment "default/redis-nodeport-none: has no endpoints" -m tcp --dport 6379 -j REJECT --reject-with icmp-port-unreachable
is for ClusterIP service and will response with "icmp port unreachable" to ClusterIP:Port
-A KUBE-EXTERNAL-SERVICES -p tcp -m comment --comment "default/redis-nodeport-none: has no endpoints" -m addrtype --dst-type LOCAL -m tcp --dport 30003 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 10.0.109.24/32 -p tcp -m comment --comment "default/redis-nodeport-none: has no endpoints" -m tcp --dport 6379 -j REJECT --reject-with icmp-port-unreachable
If we telnet NodeIP:NodePort
#telnet 10.240.0.5 30003
Trying 10.240.0.5...
telnet: Unable to connect to remote host: Connection refused
We will get "ICMP 10.240.0.5 tcp port 30003 unreachable" ICMP message.
#tcpdump -i any icmp -n -e -S
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
06:50:52.356251 In 00:00:00:00:00:00 ethertype IPv4 (0x0800), length 104: 10.240.0.5 > 10.240.0.5: ICMP 10.240.0.5 tcp port 30003 unreachable, length 68
06:50:53.366638 In 00:00:00:00:00:00 ethertype IPv4 (0x0800), length 104: 10.240.0.5 > 10.240.0.5: ICMP 10.240.0.5 tcp port 30003 unreachable, length 68
4.5 Nodeport with session affinity(redis-nodeport-sa)
NodePort session affinity is similar like ClusterIP session affinity, for example if we deploy a session affinity service with kubectl apply -f redis-nodeport-sa.yaml
#redis-nodeport-sa.yaml
apiVersion: v1
kind: Service
metadata:
name: redis-nodeport-sa
spec:
type: NodePort
sessionAffinity: ClientIP
ports:
- nodePort: 30004
port: 6379
selector:
app: redis
Two NodePort rules will be added into KUBE-NODEPORTS chain.
-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.0.96.18/32 -p tcp -m comment --comment "default/redis-nodeport-sa: cluster IP" -m tcp --dport 6379 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.0.96.18/32 -p tcp -m comment --comment "default/redis-nodeport-sa: cluster IP" -m tcp --dport 6379 -j KUBE-SVC-SZ6FWS64RW2BDF54
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/redis-nodeport-sa:" -m tcp --dport 30004 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/redis-nodeport-sa:" -m tcp --dport 30004 -j KUBE-SVC-SZ6FWS64RW2BDF54
-A KUBE-SVC-SZ6FWS64RW2BDF54 -m recent --rcheck --seconds 10800 --reap --name KUBE-SEP-LPBZDS5VEUCT3TQ6 --mask 255.255.255.255 --rsource -j KUBE-SEP-LPBZDS5VEUCT3TQ6
-A KUBE-SVC-SZ6FWS64RW2BDF54 -j KUBE-SEP-LPBZDS5VEUCT3TQ6
-A KUBE-SEP-LPBZDS5VEUCT3TQ6 -s 10.244.0.4/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-LPBZDS5VEUCT3TQ6 -p tcp -m recent --set --name KUBE-SEP-LPBZDS5VEUCT3TQ6 --mask 255.255.255.255 --rsource -m tcp -j DNAT --to-destination 10.244.0.4:6379
4.6 NodePort with "externalTrafficPolicy: Local" and session affinity(redis-nodeport-local-sa)
NodePort service with externalTrafficPolicy: Local
and session affinity is a combination of Nodeport service with session affinity and NodePort service with `externalTrafficPolicy: Local
apiVersion: v1
kind: Service
metadata:
name: redis-nodeport-local-sa
spec:
type: NodePort
externalTrafficPolicy: Local
sessionAffinity: ClientIP
ports:
- nodePort: 30005
port: 6379
selector:
app: redis
From the agent node has no local endpoints, it simply drops the requests to NodeIP:NodePort, for example, in VM aks-nodepool1-41808012-1, the iptables rules will be
-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.0.199.210/32 -p tcp -m comment --comment "default/redis-nodeport-local-sa: cluster IP" -m tcp --dport 6379 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.0.199.210/32 -p tcp -m comment --comment "default/redis-nodeport-local-sa: cluster IP" -m tcp --dport 6379 -j KUBE-SVC-INGYPNHWFCYMV7GN
-A KUBE-NODEPORTS -s 127.0.0.0/8 -p tcp -m comment --comment "default/redis-nodeport-local-sa:" -m tcp --dport 30005 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/redis-nodeport-local-sa:" -m tcp --dport 30005 -j KUBE-XLB-INGYPNHWFCYMV7GN
-A KUBE-XLB-INGYPNHWFCYMV7GN -s 10.244.0.0/16 -m comment --comment "Redirect pods trying to reach external loadbalancer VIP to clusterIP" -j KUBE-SVC-INGYPNHWFCYMV7GN
-A KUBE-XLB-INGYPNHWFCYMV7GN -m comment --comment "default/redis-nodeport-local-sa: has no local endpoints" -j KUBE-MARK-DROP
-A KUBE-SVC-INGYPNHWFCYMV7GN -m recent --rcheck --seconds 10800 --reap --name KUBE-SEP-4T6FPJ2ZBIJ6D6TT --mask 255.255.255.255 --rsource -j KUBE-SEP-4T6FPJ2ZBIJ6D6TT
-A KUBE-SVC-INGYPNHWFCYMV7GN -j KUBE-SEP-4T6FPJ2ZBIJ6D6TT
-A KUBE-SEP-4T6FPJ2ZBIJ6D6TT -s 10.244.0.4/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-4T6FPJ2ZBIJ6D6TT -p tcp -m recent --set --name KUBE-SEP-4T6FPJ2ZBIJ6D6TT --mask 255.255.255.255 --rsource -m tcp -j DNAT --to-destination 10.244.0.4:6379
While from agent node aks-nodepool1-41808012-2 which has local endpoint, it will accept inbound traffics to NodeIP:NodePort
-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.0.199.210/32 -p tcp -m comment --comment "default/redis-nodeport-local-sa: cluster IP" -m tcp --dport 6379 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.0.199.210/32 -p tcp -m comment --comment "default/redis-nodeport-local-sa: cluster IP" -m tcp --dport 6379 -j KUBE-SVC-INGYPNHWFCYMV7GN
-A KUBE-NODEPORTS -s 127.0.0.0/8 -p tcp -m comment --comment "default/redis-nodeport-local-sa:" -m tcp --dport 30005 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/redis-nodeport-local-sa:" -m tcp --dport 30005 -j KUBE-XLB-INGYPNHWFCYMV7GN
-A KUBE-XLB-INGYPNHWFCYMV7GN -s 10.244.0.0/16 -m comment --comment "Redirect pods trying to reach external loadbalancer VIP to clusterIP" -j KUBE-SVC-INGYPNHWFCYMV7GN
-A KUBE-XLB-INGYPNHWFCYMV7GN -m comment --comment "default/redis-nodeport-local-sa:" -m recent --rcheck --seconds 10800 --reap --name KUBE-SEP-4T6FPJ2ZBIJ6D6TT --mask 255.255.255.255 --rsource -j KUBE-SEP-4T6FPJ2ZBIJ6D6TT
-A KUBE-XLB-INGYPNHWFCYMV7GN -m comment --comment "Balancing rule 0 for default/redis-nodeport-local-sa:" -j KUBE-SEP-4T6FPJ2ZBIJ6D6TT
-A KUBE-SVC-INGYPNHWFCYMV7GN -m recent --rcheck --seconds 10800 --reap --name KUBE-SEP-4T6FPJ2ZBIJ6D6TT --mask 255.255.255.255 --rsource -j KUBE-SEP-4T6FPJ2ZBIJ6D6TT
-A KUBE-SVC-INGYPNHWFCYMV7GN -j KUBE-SEP-4T6FPJ2ZBIJ6D6TT
-A KUBE-SEP-4T6FPJ2ZBIJ6D6TT -s 10.244.0.4/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-4T6FPJ2ZBIJ6D6TT -p tcp -m recent --set --name KUBE-SEP-4T6FPJ2ZBIJ6D6TT --mask 255.255.255.255 --rsource -m tcp -j DNAT --to-destination 10.244.0.4:6379
5 LoadBalancer
5.1 LoadBalancer introduction
Refer to Services, LoadBalancer service
Exposes the service externally using a cloud provider’s load balancer. NodePort and ClusterIP services, to which the external load balancer will route, are automatically created.
A LoadBalancer service implicitly includes ClusterIP and NodePort.
Assuming we deploy a web frontend with kubectl apply -f vote.yaml
and it has below 5 types of LoadBalancer service, we will explain each type of LoadBalancer service in following sub-sections
- LoadBalancer service
- LoadBalancer service with externalTrafficPolicy: Local
- LoadBalancer service without any endpoints
- LoadBalancer service with session affinity
- LoadBalancer service with externalTrafficPolicy: Local and session affinity
#vote.yaml
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: vote
spec:
replicas: 1
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
minReadySeconds: 5
template:
metadata:
labels:
app: vote
spec:
containers:
- name: vote
image: microsoft/azure-vote-front:v1
ports:
- containerPort: 80
resources:
requests:
cpu: 250m
memory: 100Mi
limits:
cpu: 500m
memory: 200Mi
env:
- name: REDIS
value: "redis"
---
apiVersion: v1
kind: Service
metadata:
name: vote-lb
spec:
type: LoadBalancer
ports:
- port: 80
selector:
app: vote
---
apiVersion: v1
kind: Service
metadata:
name: vote-lb-local
spec:
type: LoadBalancer
externalTrafficPolicy: Local
ports:
- port: 80
selector:
app: vote
---
apiVersion: v1
kind: Service
metadata:
name: vote-lb-sa
spec:
type: LoadBalancer
sessionAffinity: ClientIP
ports:
- port: 80
selector:
app: vote
---
apiVersion: v1
kind: Service
metadata:
name: vote-lb-local-sa
spec:
type: LoadBalancer
externalTrafficPolicy: Local
sessionAffinity: ClientIP
ports:
- port: 80
selector:
app: vote
---
apiVersion: v1
kind: Service
metadata:
name: vote-lb-none
spec:
type: LoadBalancer
externalTrafficPolicy: Local
ports:
- port: 80
selector:
app: vote-none
5.2 Deployed services and endpoints
The deployed services and endpoints are
#kubectl get service
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
...
vote-lb LoadBalancer 10.0.96.126 104.215.199.117 80:31278/TCP 17m
vote-lb-local LoadBalancer 10.0.244.84 13.67.34.190 80:30588/TCP 17m
vote-lb-local-sa LoadBalancer 10.0.191.143 104.215.186.73 80:30423/TCP 17m
vote-lb-none LoadBalancer 10.0.174.123 13.67.108.219 80:32095/TCP 17m
vote-lb-sa LoadBalancer 10.0.66.132 104.215.184.203 80:30875/TCP 17m
...
#kubectl get endpoints
NAMESPACE NAME ENDPOINTS AGE
...
vote-lb 10.244.0.6:80 23m
vote-lb-local 10.244.0.6:80 23m
vote-lb-local-sa 10.244.0.6:80 23m
vote-lb-none <none> 23m
vote-lb-sa 10.244.0.6:80 23m
...
5.3 LoadBalancer(vote-lb)
As explained previously, LoadBalancer implicitly includes ClusterIP and NodePort services, below is in a completed list of iptables chains & rules for LoadBalancer service
-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.0.96.126/32 -p tcp -m comment --comment "default/vote-lb: cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.0.96.126/32 -p tcp -m comment --comment "default/vote-lb: cluster IP" -m tcp --dport 80 -j KUBE-SVC-VXWUQCQV72VDOWKK
-A KUBE-SERVICES -d 104.215.199.117/32 -p tcp -m comment --comment "default/vote-lb: loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-VXWUQCQV72VDOWKK
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/vote-lb:" -m tcp --dport 31278 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/vote-lb:" -m tcp --dport 31278 -j KUBE-SVC-VXWUQCQV72VDOWKK
-A KUBE-FW-VXWUQCQV72VDOWKK -m comment --comment "default/vote-lb: loadbalancer IP" -j KUBE-MARK-MASQ
-A KUBE-FW-VXWUQCQV72VDOWKK -m comment --comment "default/vote-lb: loadbalancer IP" -j KUBE-SVC-VXWUQCQV72VDOWKK
-A KUBE-FW-VXWUQCQV72VDOWKK -m comment --comment "default/vote-lb: loadbalancer IP" -j KUBE-MARK-DROP
-A KUBE-SVC-VXWUQCQV72VDOWKK -j KUBE-SEP-UBWWGAV5O3W3X7ZZ
-A KUBE-SEP-UBWWGAV5O3W3X7ZZ -s 10.244.0.6/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-UBWWGAV5O3W3X7ZZ -p tcp -m tcp -j DNAT --to-destination 10.244.0.6:80
Besides ClusterIP and NodePort iptables chains & rules, below iptables chains & rules are targeted for LoadBalancer service specificly
-A KUBE-SERVICES -d 104.215.199.117/32 -p tcp -m comment --comment "default/vote-lb: loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-VXWUQCQV72VDOWKK
-A KUBE-FW-VXWUQCQV72VDOWKK -m comment --comment "default/vote-lb: loadbalancer IP" -j KUBE-MARK-MASQ
-A KUBE-FW-VXWUQCQV72VDOWKK -m comment --comment "default/vote-lb: loadbalancer IP" -j KUBE-SVC-VXWUQCQV72VDOWKK
-A KUBE-FW-VXWUQCQV72VDOWKK -m comment --comment "default/vote-lb: loadbalancer IP" -j KUBE-MARK-DROP
Rule '-A KUBE-SERVICES -d 104.215.199.117/32 -p tcp -m comment --comment "default/vote-lb: loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-VXWUQCQV72VDOWKK' requires destination IP 104.215.199.117 to be preserved from the inbound traffics otherwise it won't be able to do the DNAT, as for cloud providers, generally they will use floating IP to keep the public VIP, hence the inbound traffics can have destination IP address remained as it is.
5.4 LoadBalancer with "externalTrafficPolicy: Local"(vote-lb-local)
It is pretty much like NodePort with "externalTrafficPolicy: Local"
iptables chains and rules from agent node aks-nodepool1-41808012-1 without any local endpoints, will drop inbound traffics to LoadBalancer IP with rule -A KUBE-FW-HL75EI7U6Y3GOE4U -m comment --comment "default/vote-lb-local: loadbalancer IP" -j KUBE-MARK-DROP
-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.0.244.84/32 -p tcp -m comment --comment "default/vote-lb-local: cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.0.244.84/32 -p tcp -m comment --comment "default/vote-lb-local: cluster IP" -m tcp --dport 80 -j KUBE-SVC-HL75EI7U6Y3GOE4U
-A KUBE-SERVICES -d 13.67.34.190/32 -p tcp -m comment --comment "default/vote-lb-local: loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-HL75EI7U6Y3GOE4U
-A KUBE-NODEPORTS -s 127.0.0.0/8 -p tcp -m comment --comment "default/vote-lb-local:" -m tcp --dport 30588 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/vote-lb-local:" -m tcp --dport 30588 -j KUBE-XLB-HL75EI7U6Y3GOE4U
-A KUBE-FW-HL75EI7U6Y3GOE4U -m comment --comment "default/vote-lb-local: loadbalancer IP" -j KUBE-XLB-HL75EI7U6Y3GOE4U
-A KUBE-FW-HL75EI7U6Y3GOE4U -m comment --comment "default/vote-lb-local: loadbalancer IP" -j KUBE-MARK-DROP
-A KUBE-XLB-HL75EI7U6Y3GOE4U -s 10.244.0.0/16 -m comment --comment "Redirect pods trying to reach external loadbalancer VIP to clusterIP" -j KUBE-SVC-HL75EI7U6Y3GOE4U
-A KUBE-XLB-HL75EI7U6Y3GOE4U -m comment --comment "default/vote-lb-local: has no local endpoints" -j KUBE-MARK-DROP
-A KUBE-SVC-HL75EI7U6Y3GOE4U -j KUBE-SEP-H2DWUQDTMSNHW2DQ
-A KUBE-SEP-H2DWUQDTMSNHW2DQ -s 10.244.0.6/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-H2DWUQDTMSNHW2DQ -p tcp -m tcp -j DNAT --to-destination 10.244.0.6:80
Note: Compare with vote-lb's iptables chains & rules, above iptables chains & rules don't have this rule -A KUBE-FW-HL75EI7U6Y3GOE4U -m comment --comment "default/vote-lb-local: loadbalancer IP" -j KUBE-MARK-MASQ
rule, that means no SNAT will be applied to source IP, and source IP address will be preserved with option "externalTrafficPolicy: Local".
iptables chains and rules from agent node aks-nodepool1-41808012-2 has local endpoints, will replace drop rule with -A KUBE-XLB-HL75EI7U6Y3GOE4U -m comment --comment "Balancing rule 0 for default/vote-lb-local:" -j KUBE-SEP-H2DWUQDTMSNHW2DQ
-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.0.244.84/32 -p tcp -m comment --comment "default/vote-lb-local: cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.0.244.84/32 -p tcp -m comment --comment "default/vote-lb-local: cluster IP" -m tcp --dport 80 -j KUBE-SVC-HL75EI7U6Y3GOE4U
-A KUBE-SERVICES -d 13.67.34.190/32 -p tcp -m comment --comment "default/vote-lb-local: loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-HL75EI7U6Y3GOE4U
-A KUBE-NODEPORTS -s 127.0.0.0/8 -p tcp -m comment --comment "default/vote-lb-local:" -m tcp --dport 30588 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/vote-lb-local:" -m tcp --dport 30588 -j KUBE-XLB-HL75EI7U6Y3GOE4U
-A KUBE-FW-HL75EI7U6Y3GOE4U -m comment --comment "default/vote-lb-local: loadbalancer IP" -j KUBE-XLB-HL75EI7U6Y3GOE4U
-A KUBE-FW-HL75EI7U6Y3GOE4U -m comment --comment "default/vote-lb-local: loadbalancer IP" -j KUBE-MARK-DROP
-A KUBE-XLB-HL75EI7U6Y3GOE4U -s 10.244.0.0/16 -m comment --comment "Redirect pods trying to reach external loadbalancer VIP to clusterIP" -j KUBE-SVC-HL75EI7U6Y3GOE4U
-A KUBE-XLB-HL75EI7U6Y3GOE4U -m comment --comment "Balancing rule 0 for default/vote-lb-local:" -j KUBE-SEP-H2DWUQDTMSNHW2DQ
-A KUBE-SVC-HL75EI7U6Y3GOE4U -j KUBE-SEP-H2DWUQDTMSNHW2DQ
-A KUBE-SEP-H2DWUQDTMSNHW2DQ -s 10.244.0.6/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-H2DWUQDTMSNHW2DQ -p tcp -m tcp -j DNAT --to-destination 10.244.0.6:80
5.5 LoadBalancer with session affinity(vote-lb-sa)
It uses same iptables chains & rules of NodePort to implement session affinity, only adds LoadBalancer's entrance rule (KUBE-FW-EZQBRNY2ES44QYCG). The session affinity implementation is same as NodePort & ClusterIP services.
-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.0.66.132/32 -p tcp -m comment --comment "default/vote-lb-sa: cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.0.66.132/32 -p tcp -m comment --comment "default/vote-lb-sa: cluster IP" -m tcp --dport 80 -j KUBE-SVC-EZQBRNY2ES44QYCG
-A KUBE-SERVICES -d 104.215.184.203/32 -p tcp -m comment --comment "default/vote-lb-sa: loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-EZQBRNY2ES44QYCG
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/vote-lb-sa:" -m tcp --dport 30875 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/vote-lb-sa:" -m tcp --dport 30875 -j KUBE-SVC-EZQBRNY2ES44QYCG
-A KUBE-FW-EZQBRNY2ES44QYCG -m comment --comment "default/vote-lb-sa: loadbalancer IP" -j KUBE-MARK-MASQ
-A KUBE-FW-EZQBRNY2ES44QYCG -m comment --comment "default/vote-lb-sa: loadbalancer IP" -j KUBE-SVC-EZQBRNY2ES44QYCG
-A KUBE-FW-EZQBRNY2ES44QYCG -m comment --comment "default/vote-lb-sa: loadbalancer IP" -j KUBE-MARK-DROP
-A KUBE-SVC-EZQBRNY2ES44QYCG -m recent --rcheck --seconds 10800 --reap --name KUBE-SEP-VSZY3UJ2EK65XWND --mask 255.255.255.255 --rsource -j KUBE-SEP-VSZY3UJ2EK65XWND
-A KUBE-SVC-EZQBRNY2ES44QYCG -j KUBE-SEP-VSZY3UJ2EK65XWND
-A KUBE-SEP-VSZY3UJ2EK65XWND -s 10.244.0.6/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-VSZY3UJ2EK65XWND -p tcp -m recent --set --name KUBE-SEP-VSZY3UJ2EK65XWND --mask 255.255.255.255 --rsource -m tcp -j DNAT --to-destination 10.244.0.6:80
5.6 LoadBalancer with "externalTrafficPolicy: Local" and session affinity(vote-lb-local-sa)
It's a combination of LoadBalancer with externalTrafficPolicy: Local and LoadBalancer with session affinity.
iptables chains & rules from agent node aks-nodepool1-41808012-1 without any local endpoints is below
-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.0.191.143/32 -p tcp -m comment --comment "default/vote-lb-local-sa: cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.0.191.143/32 -p tcp -m comment --comment "default/vote-lb-local-sa: cluster IP" -m tcp --dport 80 -j KUBE-SVC-26OY5KYCLRTVV6SV
-A KUBE-SERVICES -d 104.215.186.73/32 -p tcp -m comment --comment "default/vote-lb-local-sa: loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-26OY5KYCLRTVV6SV
-A KUBE-NODEPORTS -s 127.0.0.0/8 -p tcp -m comment --comment "default/vote-lb-local-sa:" -m tcp --dport 30423 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/vote-lb-local-sa:" -m tcp --dport 30423 -j KUBE-XLB-26OY5KYCLRTVV6SV
-A KUBE-FW-26OY5KYCLRTVV6SV -m comment --comment "default/vote-lb-local-sa: loadbalancer IP" -j KUBE-XLB-26OY5KYCLRTVV6SV
-A KUBE-FW-26OY5KYCLRTVV6SV -m comment --comment "default/vote-lb-local-sa: loadbalancer IP" -j KUBE-MARK-DROP
-A KUBE-XLB-26OY5KYCLRTVV6SV -s 10.244.0.0/16 -m comment --comment "Redirect pods trying to reach external loadbalancer VIP to clusterIP" -j KUBE-SVC-26OY5KYCLRTVV6SV
-A KUBE-XLB-26OY5KYCLRTVV6SV -m comment --comment "default/vote-lb-local-sa: has no local endpoints" -j KUBE-MARK-DROP
-A KUBE-SVC-26OY5KYCLRTVV6SV -m recent --rcheck --seconds 10800 --reap --name KUBE-SEP-ZRORXIYLQOYHSTET --mask 255.255.255.255 --rsource -j KUBE-SEP-ZRORXIYLQOYHSTET
-A KUBE-SVC-26OY5KYCLRTVV6SV -j KUBE-SEP-ZRORXIYLQOYHSTET
-A KUBE-SEP-ZRORXIYLQOYHSTET -s 10.244.0.6/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-ZRORXIYLQOYHSTET -p tcp -m recent --set --name KUBE-SEP-ZRORXIYLQOYHSTET --mask 255.255.255.255 --rsource -m tcp -j DNAT --to-destination 10.244.0.6:80
iptables chains and rules from agent node aks-nodepool1-41808012-2 has local endpoints is below
-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.0.191.143/32 -p tcp -m comment --comment "default/vote-lb-local-sa: cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.0.191.143/32 -p tcp -m comment --comment "default/vote-lb-local-sa: cluster IP" -m tcp --dport 80 -j KUBE-SVC-26OY5KYCLRTVV6SV
-A KUBE-SERVICES -d 104.215.186.73/32 -p tcp -m comment --comment "default/vote-lb-local-sa: loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-26OY5KYCLRTVV6SV
-A KUBE-NODEPORTS -s 127.0.0.0/8 -p tcp -m comment --comment "default/vote-lb-local-sa:" -m tcp --dport 30423 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/vote-lb-local-sa:" -m tcp --dport 30423 -j KUBE-XLB-26OY5KYCLRTVV6SV
-A KUBE-FW-26OY5KYCLRTVV6SV -m comment --comment "default/vote-lb-local-sa: loadbalancer IP" -j KUBE-XLB-26OY5KYCLRTVV6SV
-A KUBE-FW-26OY5KYCLRTVV6SV -m comment --comment "default/vote-lb-local-sa: loadbalancer IP" -j KUBE-MARK-DROP
-A KUBE-XLB-26OY5KYCLRTVV6SV -s 10.244.0.0/16 -m comment --comment "Redirect pods trying to reach external loadbalancer VIP to clusterIP" -j KUBE-SVC-26OY5KYCLRTVV6SV
-A KUBE-XLB-26OY5KYCLRTVV6SV -m comment --comment "default/vote-lb-local-sa:" -m recent --rcheck --seconds 10800 --reap --name KUBE-SEP-ZRORXIYLQOYHSTET --mask 255.255.255.255 --rsource -j KUBE-SEP-ZRORXIYLQOYHSTET
-A KUBE-XLB-26OY5KYCLRTVV6SV -m comment --comment "Balancing rule 0 for default/vote-lb-local-sa:" -j KUBE-SEP-ZRORXIYLQOYHSTET
-A KUBE-SVC-26OY5KYCLRTVV6SV -m recent --rcheck --seconds 10800 --reap --name KUBE-SEP-ZRORXIYLQOYHSTET --mask 255.255.255.255 --rsource -j KUBE-SEP-ZRORXIYLQOYHSTET
-A KUBE-SVC-26OY5KYCLRTVV6SV -j KUBE-SEP-ZRORXIYLQOYHSTET
-A KUBE-SEP-ZRORXIYLQOYHSTET -s 10.244.0.6/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-ZRORXIYLQOYHSTET -p tcp -m recent --set --name KUBE-SEP-ZRORXIYLQOYHSTET --mask 255.255.255.255 --rsource -m tcp -j DNAT --to-destination 10.244.0.6:80
5.7 LoadBalancer without any endpoints(vote-lb-none)
It is similar like NodePort without any endpoints, iptables chains & rules are below
-A KUBE-EXTERNAL-SERVICES -p tcp -m comment --comment "default/vote-lb-none: has no endpoints" -m addrtype --dst-type LOCAL -m tcp --dport 32095 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 10.0.174.123/32 -p tcp -m comment --comment "default/vote-lb-none: has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
6 Summary
Wrap it up,
- There are 3 types of service in kubernetes, ClusterIP, NodePort and LoadBalancer.
- LoadBalancer service implicitly includes NodePort and ClusterIP service.
- NodePort service implicitly includes ClusterIP service.
- Kubernetes injects custom chains into iptables NAT table to implement its service mode, pod network to service will use DNAT, external network to service will use SNAT and DNAT, pod connect to a service IP it serves will use hairpin NAT.
- NodePort and externalIPs will reserve a listening port on agent node to avoid conflicting with other applications.
- Session affinity makes requests from the same client alway get routed back to the same back-end pod.
- "externalTrafficPolicy: Local" preserves source IP without SNAT and proxy inbound traffics to local endpoints.
- If no backend pod matched with selector, reject rules will be inserted into the iptables.