calico pod 通信by hand

当我们将mycni的binary和conf放到节点的目录下时，pod从 pending变为running状态

1 2	/opt/cni/bin/mycni /etc/cni/net.d/00-mycni.conflist

calico

集群采用了calico插件

部署

// 
  containers:
  - image: harbor.archeros.cn:443/library/ake/calico/node:v3.18.0-arm64
    env:
    - name: IP_AUTODETECTION_METHOD
      value: interface=data

    - name: CALICO_NETWORKING_BACKEND
      valueFrom:
        configMapKeyRef:
          key: calico_backend
          name: calico-config
          
    - name: CALICO_IPV4POOL_IPIP
      value: Always
    - name: CALICO_IPV4POOL_VXLAN
      value: Never

IPPool对象

# calicoctl get ippool default-ipv4-ippool -o yaml
apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
  creationTimestamp: "2023-05-08T02:48:58Z"
  name: default-ipv4-ippool
  resourceVersion: "421083"
  uid: 28cf34ae-d5e4-4efd-b8a2-67ae7917412f
spec:
  blockSize: 26
  cidr: 10.244.0.0/16
  ipipMode: Always
  natOutgoing: true
  nodeSelector: all()
  vxlanMode: Never

cidr需要和kube-controller-manager --cluster-cidr=10.244.0.0/16保持一致
ipipMode和vxlanMode均为Never即BGP模式

pod中默认路由

进入到pod中

# kubectl get pod -o wide
NAME                          READY   STATUS    RESTARTS   AGE   IP            NODE      NOMINATED NODE   READINESS GATES
my-busybox-68dc44bf88-j7scp   1/1     Running   0          24m   10.10.50.4    master4   <none>           <none>
my-busybox-68dc44bf88-qhpss   1/1     Running   0          24m   10.10.50.16   master1   <none>           <none>

# kubectl exec  -it my-busybox-68dc44bf88-qhpss sh
# ip a
3: eth0@if17: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue
    link/ether 12:ca:8c:21:5a:c4 brd ff:ff:ff:ff:ff:ff
    inet 10.10.50.16/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::10ca:8cff:fe21:5ac4/64 scope link
       valid_lft forever preferred_lft forever
查看pod中路由
/ # ip r
default via 169.254.1.1 dev eth0
169.254.1.1 dev eth0 scope link

查看arp表
/ # ip neigh
169.254.1.1 dev eth0 lladdr ee:ee:ee:ee:ee:ee used 0/0/0 probes 1 STALE
178.104.227.4 dev eth0 lladdr ee:ee:ee:ee:ee:ee used 0/0/0 probes 0 STALE

pod内网口ip:10.10.50.16/32是32位的，不与任何机器同网络，强制走纯三层的路由。pod中访问默认路由169.254.1.1，请求其mac，根据本地arp表，返回网关mac是全e。其实这mac是veth peer（host侧），请求在二层到达host侧网口。

pod中的默认路由是169.254.1.1 它不是任何网卡的的地址。但是它是有特殊作用的。
当pod访问外部地址时，总是会去访问默认路由169.254.1.1，发出arp广播，在host侧对应的veth87f321fcac1 vethpair网卡会收到，并且开启了arp代码功能，返回自己的mac地址，从而将pod发出的流量引到host侧的calixxxveth。
流量达到host上后，根据本地路由表走流量了。

在做路由寻址的时候，ip层不变，应该只是src/dst mac根据出网口而替换，退化成二层通信。

在host侧对应的eth口

17: veth87f321fcac1@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-4609f7ab-b55d-077d-c455-3d32c5a7b655
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link
       valid_lft forever preferred_lft forever
# cat /proc/sys/net/ipv4/conf/veth87f321fcac1/proxy_arp
1

网卡开启arp代答，指可以让其他设备（pod中默认网关）待自己回答mac。

calico pod-host通信手动模拟

# 开始路由转发，本机当做路由器使用
cat > /etc/sysctl.d/30-ipforward.conf<<EOL
net.ipv4.ip_forward=1
net.ipv6.conf.default.forwarding=1
net.ipv6.conf.all.forwarding=1
EOL
sysctl -p /etc/sysctl.d/30-ipforward.conf

# host侧配置
ip netns add ns3
ip link add tap3 type veth peer name veth1 netns ns3
ip link set address ee:ee:ee:ee:ee:ee dev tap3  // 定制peer eth的mac
echo 1 > /proc/sys/net/ipv4/conf/tap3/proxy_arp
ip link set tap3 up
ip r a 10.42.1.13 dev tap3  // 确定mac

# pod侧配置
ip netns exec ns3 ip addr add 10.42.1.13/32 dev veth1
ip netns exec ns3 ip route add 169.254.1.1 dev veth1 // 网络路由
ip netns exec ns3 ip route add default via 169.254.1.1 dev veth1  // 默认路由
ip netns exec ns3 ip neigh add 169.254.1.1 dev veth1 lladdr ee:ee:ee:ee:ee:ee
ip netns exec ns3 ip link set veth1 up

#从host ping pod网卡
ping 10.42.1.13
PING 10.42.1.13 (10.42.1.13) 56(84) bytes of data.
64 bytes from 10.42.1.13: icmp_seq=1 ttl=64 time=48.7 ms

# 从pod网卡ping host
ip netns exec ns3 ping 178.104.163.26
PING 178.104.163.26 (178.104.163.26) 56(84) bytes of data.
64 bytes from 178.104.163.26: icmp_seq=1 ttl=64 time=0.196 ms

calico node间通信

node间通信分成同网络和跨网络两种，区别就是tunl0 via指定的下一跳地址。

同网络，下一跳就是peer node
跨网络，下一跳就是可以达到peer host的本机网卡

下面以跨节点跨网络的场景分析

跨节点跨网络路由分析

data表示vm（k8s部署在vm中）中网卡，从该网卡出的包会经过底层的vrouter控制面进行转发, 找到正确的出网口，在vrouter中，目标设置是在other nodes, 本节点的vrouter包会转发到其他节点的vrouter，从而找到正确的出网口。

在本节点vm中

# ip a s data
4: data: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000
    link/ether fa:16:b2:76:e8:07 brd ff:ff:ff:ff:ff:ff
    inet 178.118.232.7/24 brd 178.118.232.255 scope global noprefixroute data
       valid_lft forever preferred_lft forever

# ip r
default via 178.118.230.1 dev eth0 proto dhcp metric 100

// pod目标网段表明pod在其他机器，下一跳是178.118.232.x（虚拟ip，仅起辅助路由作用）, 从tunl0出， tunl0进行ipip封装，via所指地址是出包的外层ip
10.244.36.192/26 via 178.118.232.191 dev tunl0 proto bird onlink
10.244.57.64/26 via 178.118.232.35 dev tunl0 proto bird onlink
10.244.136.0/26 via 178.118.232.7 dev tunl0 proto bird onlink
10.244.137.64/26 via 178.118.232.221 dev tunl0 proto bird onlink
10.244.166.128/26 via 178.118.232.198 dev tunl0 proto bird onlink
10.244.175.64/26 via 178.118.232.63 dev tunl0 proto bird onlink
// 经tunl ipip封装后的包会报该条路由，从本机的data网卡出去
178.118.232.0/24 dev data proto kernel scope link src 178.118.232.148 metric 102

blackhole 10.244.180.0/26 proto bird
// 全是本机pod的ip，32位， 在本机可以
10.244.180.4 dev cali97a0a28d8e4 scope link
10.244.180.10 dev cali119b29c78fc scope link
10.244.180.15 dev cali15c7619fccc scope link
10.244.180.16 dev cali8ece1094b64 scope link
10.244.180.17 dev cali17c27a1ff33 scope link
10.244.180.20 dev caliaeb7f8a6c25 scope link
10.244.180.27 dev cali90e4a410491 scope link
10.244.180.32 dev cali1abad8afd57 scope link
10.244.180.35 dev cali1bff4958e4e scope link
10.244.180.48 dev cali6c3bcfc264e scope link
10.244.180.55 dev cali1ad0b3407f2 scope link
10.244.180.58 dev calibbb6448fd94 scope link
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
178.118.230.0/24 dev eth0 proto kernel scope link src 178.118.230.223 metric 100
178.118.231.0/24 dev eth1 proto kernel scope link src 178.118.231.238 metric 101
178.118.232.0/24 dev data proto kernel scope link src 178.118.232.148 metric 102

# ip a s tunl0
10: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1430 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
    inet 10.244.180.11/32 scope global tunl0
       valid_lft forever preferred_lft forever

疑问：

下一跳地址（178.118.232.x）应该和出网口tunl0是同网络的吧，本环境竟然不同网络？
Ans: tunl0这个接口没有ARP，所以没有同子网的限制, 发往它的包全部接收，而且因为是ipip封装，封装后的包的外层dst ip会被封装为via所指地址，再在本机路由
下一跳地址要是同网络应该就一个就可以，这环境为啥有好多个（178.118.232.x）
Ans: 如果有多条等效路径访问目的地，则会使用 ECMP (Equal Cost Multipath) 协议，它可以将流量分配到不同的路径上以实现负载均衡和冗余性。在这种情况下，会有多个下一跳地址

路由分析：

scope link：在同一网络中的设备间通信，表示添加的这个路由只适用于与dev calixxx 接口直接相连的设备，而不会被发送到其他网络或路由器上。这通常用于将数据包限制在本地链路层范围内，以减少网络流量和提高安全性
proto dhcp: 路由器会向 DHCP 服务器请求一个可用的 IP 地址，并将该 IP 地址分配给本地网络接口，从而使设备能够访问网络和外部互联网
proto kernel: Linux内核自带的路由协议, 系统会将内部的路由信息导入到路由表中，并且允许用户对这些路由进行配置和操作, 如从docker0、enp1s0两张网卡出的路由。

tunl0作为ipinip封装网口，封装后，总data网卡出去，就是sdn vrouter 在vm侧的tap设备

ns角度

master2# ip netns list
cni-9cabc924-357e-2fed-0d21-4154a358a505 (id: 11)
cni-d5810fa6-d845-0aee-82d3-7352c89a0574 (id: 3)
cni-c3170d7f-4ae5-c3c0-18f9-06ada858cd94 (id: 0)

# ip netns exec cni-9cabc924-357e-2fed-0d21-4154a358a505 sh
sh-4.2# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if534613: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1430 qdisc noqueue state UP group default
    link/ether 6a:4a:f8:11:04:cb brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.244.180.32/32 brd 10.244.180.32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::684a:f8ff:fe11:4cb/64 scope link
       valid_lft forever preferred_lft forever
sh-4.2# exit
exit
[root@master2 ~]# kubectl get pod -A -o wide |grep 10.244.180.32
monitoring       prometheus-k8s-1                                                  3/3     Running   1 (151m ago)    151m    10.244.180.32     master2   <none>           <none>

SDN vrouter 转发

跨节点流量从data出vm后，本机的vrouter会查询其路由表确定出网口，如果本机没有，则将包转发到其他节点的vrouter进行查询，从而实现跨节点通信。

vrouter名字可以看出，还是基于路由规则去寻找出口。

跨节点同网络手动模拟

依赖bird实现节点路由分发

bird

BIRD（BIRD Internet Routing Daemon）是一款可运行在Linux和其他类Unix系统上的路由软件，它实现了多种路由协议，比如BGP、OSPF、RIP等。

BIRD会在内存中维护许多路由表，路由表根据不同的协议，通过交换路由信息，来更新路由规则。

Calico使用BGP协议对各节点的容器网络进行路由交换。
Calico中使用的软件BGP方案是Bird，而且仅起到一个管理路由表的作用
calico bird是在bird基础上进行了封装，额外增加了IP-in-IP模式下的路由功能

每个节点均下载安装bird

docker pull calico/node:v3.11.1
docker run --name calico-temp -d calico/node:v3.11.1 sleep 200
docker cp calico-temp:/usr/bin/bird ./
docker cp calico-temp:/usr/bin/bird6 ./
docker cp calico-temp:/usr/bin/birdcl ./
docker stop calico-temp
docker rm calico-temp
chmod +x /usr/local/sbin/bird*

配置BIRD

mkdir /etc/bird-cfg/

cat > /etc/bird-cfg/bird.cfg << EOL
protocol static {
   # IP blocks for this host.  // bird会生成两条路有，这是blackhole的一条。
   route 10.42.1.0/24 blackhole; 
}

# Aggregation of routes on this host; export the block, nothing beneath it.
function calico_aggr ()
{
      # Block 10.42.1.0/24 is confirmed
      if ( net = 10.42.1.0/24 ) then { accept; }
      if ( net ~ 10.42.1.0/24 ) then { reject; }
}


filter calico_export_to_bgp_peers {
  calico_aggr();
  if ( net ~ 10.42.0.0/16 ) then {
    accept;
  }
  reject;
}

filter calico_kernel_programming {
  if ( net ~ 10.42.0.0/16 ) then {
    krt_tunnel = "tunl0";
    accept;
  }
  accept;
}

router id 10.211.55.5;

# Configure synchronization between routing tables and kernel.
protocol kernel {
  learn;             # Learn all alien routes from the kernel
  persist;           # Don't remove routes on bird shutdown
  scan time 2;       # Scan kernel routing table every 2 seconds
  import all;
  export filter calico_kernel_programming; # Default is export none
  graceful restart;  # Turn on graceful restart to reduce potential flaps in
                     # routes when reloading BIRD configuration.  With a full
                     # automatic mesh, there is no way to prevent BGP from
                     # flapping since multiple nodes update their BGP
                     # configuration at the same time, GR is not guaranteed to
                     # work correctly in this scenario.
}

# Watch interface up/down events.
protocol device {
  debug all;
  scan time 2;    # Scan interfaces every 2 seconds
}

protocol direct {
  debug all;
  interface -"tap*", "*"; # Exclude tap* but include everything else.
}

# Template for all BGP clients
template bgp bgp_template {
  debug all;
  description "Connection to BGP peer";
  local as 64512;
  multihop;
  gateway recursive; # This should be the default, but just in case.
  import all;        # Import all routes, since we don't know what the upstream
                     # topology is and therefore have to trust the ToR/RR.
  export filter calico_export_to_bgp_peers;  # Only want to export routes for workloads.
  source address 10.211.55.5;  # The local address we use for the TCP connection
  add paths on;
  graceful restart;  # See comment in kernel section about graceful restart.
  connect delay time 2;
  connect retry time 5;
  error wait time 5,30;
}

protocol bgp Mesh_10_211_55_6 from bgp_template {
  neighbor 10.211.55.6 as 64512;
}
EOF

下面是各个节点的配置

节点178.104.163.104配置

环境预准备

cat > /etc/sysctl.d/30-ipforward.conf<<EOL
net.ipv4.ip_forward=1
net.ipv6.conf.default.forwarding=1
net.ipv6.conf.all.forwarding=1
EOL
sysctl -p /etc/sysctl.d/30-ipforward.conf

配置host和pod通信网络

ip netns add ns1
ip netns add ns2

ip link add tap1 type veth peer name veth1 netns ns1
ip link add tap2 type veth peer name veth1 netns ns2

ip l set address ee:ee:ee:ee:ee:ee dev tap1
ip l set address ee:ee:ee:ee:ee:ee dev tap2

echo 1 > /proc/sys/net/ipv4/conf/tap1/proxy_arp
echo 1 > /proc/sys/net/ipv4/conf/tap2/proxy_arp

ip link set tap1 up
ip link set tap2 up

ip r a 10.42.1.11 dev tap1
ip r a 10.42.1.12 dev tap2

ip netns exec ns1 ip addr add 10.42.1.11/32 dev veth1
ip netns exec ns2 ip addr add 10.42.1.12/32 dev veth1

ip netns exec ns1 ip link set veth1 up
ip netns exec ns2 ip link set veth1 up

ip netns exec ns1 ip link set lo up
ip netns exec ns2 ip link set lo up

ip netns exec ns1 ip route add 169.254.1.1 dev veth1
ip netns exec ns2 ip route add 169.254.1.1 dev veth1

ip netns exec ns1 ip route add default via 169.254.1.1 dev veth1 
ip netns exec ns2 ip route add default via 169.254.1.1 dev veth1

ip netns exec ns1 ip neigh add 169.254.1.1 dev veth1 lladdr ee:ee:ee:ee:ee:ee 
ip netns exec ns2 ip neigh add 169.254.1.1 dev veth1 lladdr ee:ee:ee:ee:ee:ee

跨节点通信需要知道peer node路由，手动配置是不现实的，calico使用了bird+felix实现集群节点路由管理。

modprobe ipip
ip a a 10.42.1.0/32 brd 10.42.1.0 dev tunl0
ip link set tunl0 up
iptables -F

节点178.104.163.28配置

cat > /etc/sysctl.d/30-ipforward.conf<<EOL
net.ipv4.ip_forward=1
net.ipv6.conf.default.forwarding=1
net.ipv6.conf.all.forwarding=1
EOL
sysctl -p /etc/sysctl.d/30-ipforward.conf

ip netns add ns1
ip netns add ns2

ip link add tap1 type veth peer name veth1 netns ns1
ip link add tap2 type veth peer name veth1 netns ns2

ip l set address ee:ee:ee:ee:ee:ee dev tap1
ip l set address ee:ee:ee:ee:ee:ee dev tap2

echo 1 > /proc/sys/net/ipv4/conf/tap1/proxy_arp
echo 1 > /proc/sys/net/ipv4/conf/tap2/proxy_arp

ip link set tap1 up
ip link set tap2 up

ip r a 10.42.2.11 dev tap1
ip r a 10.42.2.12 dev tap2

ip netns exec ns1 ip addr add 10.42.2.11/32 dev veth1
ip netns exec ns2 ip addr add 10.42.2.12/32 dev veth1

ip netns exec ns1 ip link set veth1 up
ip netns exec ns2 ip link set veth1 up

ip netns exec ns1 ip link set lo up
ip netns exec ns2 ip link set lo up

ip netns exec ns1 ip route add 169.254.1.1 dev veth1
ip netns exec ns2 ip route add 169.254.1.1 dev veth1

ip netns exec ns1 ip route add default via 169.254.1.1 dev veth1 
ip netns exec ns2 ip route add default via 169.254.1.1 dev veth1

ip netns exec ns1 ip neigh add 169.254.1.1 dev veth1 lladdr ee:ee:ee:ee:ee:ee 
ip netns exec ns2 ip neigh add 169.254.1.1 dev veth1 lladdr ee:ee:ee:ee:ee:ee

modprobe ipip
ip a a 10.42.2.0/32 brd 10.42.2.0 dev tunl0
ip link set tunl0 up
iptables -F

178.104.163.104启动bird

# ./bird -R -s /var/run/bird.ctl -d -c /etc/bird-cfg/bird.cfg
bird: device1: Initializing
bird: direct1: Initializing
bird: Mesh_178_104_163_28: Initializing
bird: device1: Starting
bird: device1: Scanning interfaces
bird: device1: Connected to table master
bird: device1: State changed to feed
bird: direct1: Starting
bird: direct1: Connected to table master
bird: direct1: State changed to feed
bird: Mesh_178_104_163_28: Starting
bird: Mesh_178_104_163_28: State changed to start
bird: Graceful restart started
bird: Started
bird: device1: State changed to up
bird: direct1 < interface lo goes up
bird: direct1 < primary address 127.0.0.0/8 on interface lo added
bird: direct1 < interface eth0 goes up
bird: direct1 < primary address 178.104.163.0/24 on interface eth0 added
bird: direct1 > added [best] 178.104.163.0/24 dev eth0
bird: direct1 < interface docker0 goes up
bird: direct1 < primary address 172.17.0.0/16 on interface docker0 added
bird: direct1 > added [best] 172.17.0.0/16 dev docker0
bird: direct1 < interface tunl0 goes up
bird: direct1 < primary address 10.42.1.0/32 on interface tunl0 added
bird: direct1 > added [best] 10.42.1.0/32 dev tunl0
bird: direct1 < interface tap1 created
bird: direct1 < interface tap2 created
bird: direct1 < interface tap3 created
bird: direct1: State changed to up
bird: Mesh_178_104_163_28: Started
bird: Mesh_178_104_163_28: Connect delayed by 2 seconds
bird: Mesh_178_104_163_28: Connecting to 178.104.163.28 from local address 178.104.163.104
bird: device1: Scanning interfaces
// peer node bird还没有启动，所以Connection refused
bird: Mesh_178_104_163_28: Connection lost (Connection refused)
bird: Mesh_178_104_163_28: Connect delayed by 2 seconds
bird: Mesh_178_104_163_28: Connecting to 178.104.163.28 from local address 178.104.163.104
bird: device1: Scanning interfaces
bird: Mesh_178_104_163_28: Connection lost (Connection refused)
bird: Mesh_178_104_163_28: Connect delayed by 2 seconds
bird: Mesh_178_104_163_28: Connecting to 178.104.163.28 from local address 178.104.163.104

// peer node bird进程启动了
bird: device1: Scanning interfaces
bird: Mesh_178_104_163_28: Connected
bird: Mesh_178_104_163_28: Sending OPEN(ver=4,as=64512,hold=240,id=b268a368)
bird: Mesh_178_104_163_28: Got OPEN(as=64512,hold=240,id=b268a31c)
bird: Mesh_178_104_163_28: Sending KEEPALIVE
bird: Mesh_178_104_163_28: Got KEEPALIVE
bird: Mesh_178_104_163_28: BGP session established
bird: Mesh_178_104_163_28: Connected to table master
bird: Mesh_178_104_163_28: State changed to feed
bird: Mesh_178_104_163_28 < filtered out 0.0.0.0/0 via 178.104.163.1 on eth0
bird: Mesh_178_104_163_28 < filtered out 10.42.1.11/32 dev tap1
bird: Mesh_178_104_163_28 < filtered out 10.42.1.12/32 dev tap2
bird: Mesh_178_104_163_28 < filtered out 10.42.1.13/32 dev tap3
bird: Mesh_178_104_163_28 < added 10.42.1.0/24 blackhole
bird: Mesh_178_104_163_28 < filtered out 10.42.1.0/32 dev tunl0
bird: Mesh_178_104_163_28 < filtered out 178.104.163.0/24 dev eth0
bird: Mesh_178_104_163_28 < filtered out 172.17.0.0/16 dev docker0
bird: Mesh_178_104_163_28: State changed to up
bird: Graceful restart done
bird: Mesh_178_104_163_28: Sending UPDATE
bird: Mesh_178_104_163_28: Sending END-OF-RIB
bird: Mesh_178_104_163_28: Got UPDATE
bird: Mesh_178_104_163_28 > added [best] 10.42.2.0/24 via 178.104.163.28 on eth0
bird: Mesh_178_104_163_28 < rejected by protocol 10.42.2.0/24 via 178.104.163.28 on eth0
bird: Mesh_178_104_163_28: Got UPDATE
bird: Mesh_178_104_163_28: Got END-OF-RIB
bird: device1: Scanning interfaces
bird: device1: Scanning interfaces
bird: device1: Scanning interfaces
bird: device1: Scanning interfaces
bird: device1: Scanning interfaces
bird: device1: Scanning interfaces
一直运行

178.104.163.28启动bird是类似的log

运行bird后，在server1上多了两条路由，因为是同网络下一跳163.28就是server2的网口，如果不是同网络，via将

# on 178.104.163.104
# diff r2 r1
< blackhole 10.42.1.0/24 proto bird
6d4
< 10.42.2.0/24 via 178.104.163.28 dev tunl0 proto bird onlink


# on 178.104.163.28
# diff r2 r1
< 10.42.1.0/24 via 178.104.163.104 dev tunl0 proto bird onlink
< blackhole 10.42.2.0/24 proto bird

bird要生成去往对方pod的路由

两节点上tunl0需要配置正确的ip，broadcase.

验证

node1 ns1去ping node2上的pod ip
# ip netns exec ns1 ping 10.42.2.11
PING 10.42.2.11 (10.42.2.11) 56(84) bytes of data.
64 bytes from 10.42.2.11: icmp_seq=1 ttl=62 time=1.62 ms
64 bytes from 10.42.2.11: icmp_seq=2 ttl=62 time=1.11 ms




node1 ns1 pod ping node2的pod ip
# ip netns exec ns1 ping 10.42.1.11
PING 10.42.1.11 (10.42.1.11) 56(84) bytes of data.
64 bytes from 10.42.1.11: icmp_seq=1 ttl=62 time=0.793 ms
64 bytes from 10.42.1.11: icmp_seq=2 ttl=62 time=1.19 ms

node1 ns1 pod ping node2 eth 不通
# ip netns exec ns1 ping 10.211.55.6
PING 10.211.55.6 (10.211.55.6) 56(84) bytes of data.
^C

可以在tunl0及tap设备上抓到包，但是无法在server网口上（enp0s5）抓到包, why?

Ans： tunl0->enp0s5(内核态完成封包) => enp0s5(内核态完成解包) -> tunl0; 用户空间无法抓到enp0s5的包。

node1 pod <=> node2 pod
node1 pod <=> node1 eth
node1 pod <= node2 eth, 反之不行
node1 eth => node2 pod，反之不行

在node1 pod去ping node2 eth不通，反而确实通的。

# ip netns exec ns1 ping 10.211.55.6
PING 10.211.55.6 (10.211.55.6) 56(84) bytes of data.

# ping 10.42.1.11
PING 10.42.1.11 (10.42.1.11) 56(84) bytes of data.
64 bytes from 10.42.1.11: icmp_seq=1 ttl=63 time=1.65 ms

怀疑点：

node1 pod1 ping node2 pod2
在node2上tunl0抓包
# tcpdump icmp -enpli tunl0

00:34:25.645026 ip: 10.42.1.11 > 10.42.2.11: ICMP echo request, id 51584, seq 7, length 64
00:34:25.645761 ip: 10.42.2.11 > 10.42.1.11: ICMP echo reply, id 51584, seq 7, length 64


node1 eth <=> node2 eth
# ping 10.211.55.6

on node2
# tcpdump icmp -enpli enp0s5
00:39:01.232519 00:1c:42:b5:29:64 > 00:1c:42:c9:21:1e, ethertype IPv4 (0x0800), length 98: 10.211.55.5 > 10.211.55.6: ICMP echo request, id 29, seq 1, length 64
00:39:01.232611 00:1c:42:c9:21:1e > 00:1c:42:b5:29:64, ethertype IPv4 (0x0800), length 98: 10.211.55.6 > 10.211.55.5: ICMP echo reply, id 29, seq 1, length 6

通过抓包可以看到，包内容不通。

结论

bird会生成两条路有，blackhole及到达其他host pod的路由。
pod可以访问自己的host和pods on other hosts。
节点都可以访问pod, 但是反之，pod只能访问自己host eth, 无法访问other hosts eth.