Nginx proxy_pass到AWS ALB的504问题-工具盒子

　　我们的部分后端服务正在经历容器化的改造，由于历史包袱，现网的网关等设施无法一次性迁移到 k8s 集群中，因此使用 Nginx proxy_pass 转发到 AWS ALB 这样一个曲线救国的临时方案。
　　但是在使用时，我们发现一段时间后 Nginx 出现了 504 的错误，检查后端服务均是正常的，而单独访问 ALB 也是正常响应的，因此便有了此文。 {#more}

问题描述 {#问题描述}

我们的 upstream 配置如下:

|---------------|------------------------------------------------------------------------------------------------------------------| | 1 2 3 | location /xxx-service/ { proxy_pass http://gateway-service-alb-xxx.xxx.elb.amazonaws.com/xxx-service/; } |

在重载了 Nginx 后恢复了正常，但过一段时间后同样的问题又出现了，检查 Nginx 的错误日志如下：

|-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 1 2 3 4 5 | [error] 297612#297612: *2235585 no live upstreams while connecting to upstream, client: 3.0.xx.183, server: xxx.xxx.xxx, request: "GET /health HTTP/1.1", upstream: "http://gateway-service-alb-xxx.xxx.elb.amazonaws.com/", host: "xxx.xxx.xxx" ... # reload 后过一段时间再次出现 error [error] 297612#297612: *2235596 no live upstreams while connecting to upstream, client: 210.3.xx.148, server: xxx.xxx.xxx, request: "GET /health HTTP/1.1", upstream: "http://gateway-service-alb-xxx.xxx.elb.amazonaws.com/", host: "xxx.xxx.xxx" |

问题排查 {#问题排查}

如上的问题描述有个关键的点： nginx -s reload 后即恢复了正常，通过这个点可以察觉到这个问题可能出现在 Nginx 上，而不是网关应用上。
根据 Nginx 的错误日志，我突然发现 upstream 的 IP 地址在 reload 后变化了。那么便有了方向了，问题可能出现在 Nginx 对 proxy_pass 中域名的解析上。

通过查阅资料发现：
原生的 Nginx 使用 proxy_pass 到一个包含域名的 upstream 时，会在配置加载时对这个域名做一次 DNS Query，之后会将这次解析到的 DNS Record 缓存，直到下一次配置加载或重启时才会重新做 DNS Query。
而 AWS ALB 作为托管的弹性负载均衡器，默认情况下的 IP 地址是会不定期进行变化更新的：
About dynamic change of IP address when using ELB | AWS
Application Load Balancer IP Change Event

这就导致了当 ALB 的 IP 地址发生变化时，Nginx 无法感知到 DNS Record 的变化，没有正确的将流量转发到新的 uptream ，引发了 504 的问题。

解决方案 {#解决方案}

明确了问题原因，要想解决这个问题，当然是要在 ALB 的 DNS 发生变化的时候，告诉 Nginx，让其获取最新的 DNS Record，从而正确的路由流量。

那我们是要定时来 reload Nginx 吗？显然这一点都不优雅，我们有更好的方式来实现同样的目的：

使用变量动态解析 {#使用变量动态解析}

在 Nginx 官方文档 Module ngx_http_proxy_module 中有这么一段话：

Parameter value can contain variables. In this case, if an address is specified as a domain name, the name is searched among the described server groups, and, if not found, is determined using a resolver .

这里提到，我们的 proxy_pass value 可以是一个变量，这样 Nginx 会从 resolver 中去做 DNS Query 获取 IP 地址。

这样就好办了，我们可以将配置修改为如下:

|------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | server { listen 80 ; listen 443 ssl; server_name xxx.xxx; # 指定 DNS resolver resolver 8.8.8.8 ; # 定义一个变量 lb_upstream set $lb_upstream http://gateway-service-alb-xxx.xxx.elb.amazonaws.com; location / { # 使用变量形式指定 proxy_pass proxy_pass $lb_upstream ; } # ... } |

指定 Nginx 使用 resolver 动态解析 proxy_pass 的 DNS，按照设想，Nginx 每次请求都会去请求 DNS Query 来获得最新的 DNS 解析记录。

这里我们在 Nginx 所在服务器来进行 DNS Query 的抓包，以验证我们的猜测：

|-------------|-------------------------------------------------------------------------------------------------------------------------------------------------| | 1 2 | # 抓取所有网卡中 53 端口（也就是 DNS Query）相关的包，过滤出我们指定的 8.8.8.8 DNS resolver: sudo tcpdump -i any -n 'udp port 53 or tcp port 53'|grep '8.8.8.8.53' |

使用 tcpdump 抓包后，访问我们的 Nginx，可以得到如下类似日志：

|---------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 1 2 3 4 5 6 7 8 9 | tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes 14:42:32.020783 IP 10.0.60.121.34798 > 8.8.8.8.53: 9593+ A? http://gateway-service-alb-xxx.xxx.elb.amazonaws.com. (87) 14:42:32.020801 IP 10.0.60.121.34798 > 8.8.8.8.53: 37392+ AAAA? http://gateway-service-alb-xxx.xxx.elb.amazonaws.com. (87) 14:42:32.026549 IP 8.8.8.8.53 > 10.0.60.121.34798: 9593 2/0/0 A x.x.x.x, A x.x.x.2x (119) 14:42:32.071872 IP 8.8.8.8.53 > 10.0.60.121.34798: 37392 0/1/0 (169) |

诶，似乎不太对， Nginx 并没有在每次请求时都去对 proxy_pass 做 DNS Query，这时候我再回想起 resolver 文档中有个选项：

By default, nginx caches answers using the TTL value of a response. An optional valid parameter allows overriding it:
resolver 127.0.0.1 [::1]:5353 valid=30s;

也就是说，Nginx resolver 默认是遵循 DNS 的 TTL 的，而 AWS ALB 的域名 TTL 默认为 60s：

|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | dig gateway-service-alb-xxx.xxx.elb.amazonaws.com @8.8.8.8 ; <<>> DiG 9.16.1-Ubuntu <<>> gateway-service-alb-xxx.xxx.elb.amazonaws.com @8.8.8.8 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36168 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 512 ;; QUESTION SECTION: ;gateway-service-alb-xxx.xxx.elb.amazonaws.com. IN A ;; ANSWER SECTION: gateway-service-alb-xxx.xxx.elb.amazonaws.com. 60 IN A x.x.x.x gateway-service-alb-xxx.xxx.elb.amazonaws.com. 60 IN A x.x.x.2x ;; Query time: 4 msec ;; SERVER: 8.8.8.8#53(8.8.8.8) ;; WHEN: Wed Jan 24 15:56:47 CST 2024 ;; MSG SIZE rcvd: 130 |

我们在上面的配置中加入这一参数再次来验证一下：

|-------------|--------------------------------------------------------| | 1 2 | # 指定 DNS resolver resolver 8.8.8.8 valid= 1 s; |

nginx -s reload 后，再次抓包后发现，现在每隔 1s 的请求都会进行 DNS Query，验证了我们的猜想：

|------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes 14:43:32.020783 IP 10.0.60.121.34798 > 8.8.8.8.53: 9593+ A? http://gateway-service-alb-xxx.xxx.elb.amazonaws.com. (87) 14:43:32.020801 IP 10.0.60.121.34798 > 8.8.8.8.53: 37392+ AAAA? http://gateway-service-alb-xxx.xxx.elb.amazonaws.com. (87) 14:43:32.026549 IP 8.8.8.8.53 > 10.0.60.121.34798: 9593 2/0/0 A x.x.x.x, A x.x.x.2x (119) 14:43:32.071872 IP 8.8.8.8.53 > 10.0.60.121.34798: 37392 0/1/0 (169) 14:43:33.020783 IP 10.0.60.121.34798 > 8.8.8.8.53: 9593+ A? http://gateway-service-alb-xxx.xxx.elb.amazonaws.com. (87) 14:43:33.020801 IP 10.0.60.121.34798 > 8.8.8.8.53: 37392+ AAAA? http://gateway-service-alb-xxx.xxx.elb.amazonaws.com. (87) 14:43:33.026549 IP 8.8.8.8.53 > 10.0.60.121.34798: 9593 2/0/0 A x.x.x.x, A x.x.x.2x (119) 14:43:33.071872 IP 8.8.8.8.53 > 10.0.60.121.34798: 37392 0/1/0 (169) 14:43:34.020783 IP 10.0.60.121.34798 > 8.8.8.8.53: 9593+ A? http://gateway-service-alb-xxx.xxx.elb.amazonaws.com. (87) 14:43:34.020801 IP 10.0.60.121.34798 > 8.8.8.8.53: 37392+ AAAA? http://gateway-service-alb-xxx.xxx.elb.amazonaws.com. (87) 14:43:34.026549 IP 8.8.8.8.53 > 10.0.60.121.34798: 9593 2/0/0 A x.x.x.x, A x.x.x.2x (119) 14:43:34.071872 IP 8.8.8.8.53 > 10.0.60.121.34798: 37392 0/1/0 (169) |

那么我们真的需要手动去指定这个 valid 参数吗？
其实是不用的，在此文的这个场景下，直接遵循 ALB 域名的 DNS TTL 即可，过于频繁的 DNS Query 并不是一件好事，这会带来额外的不必要的性能开销，我们也无法决定 ALB 的 TTL。
如果是用于类似 DDNS 等需要快速获得最新 DNS 记录的场景，这时候才需要按需手动调整 valid 参数。

upstream使用变量带来的问题 {#upstream使用变量带来的问题}

前面我们虽然利用变量来解决了 DNS 解析的问题，但同时引入了一个新的问题，当 location 参数不为 / ，而 proxy_pass 的参数是一个变量时， proxy_pass 的行为与我们预期的有些不同：

proxy_pass 不使用变量 {#proxy-pass-不使用变量}

当我们的 proxy_pass 不使用变量，且不带 / ：

|---------------|------------------------------------------------------------| | 1 2 3 | location /a/ { proxy_pass http://127.0.0.1:8080; } |

我们访问 nginx/a/b/c 时，Nginx 会将请求转发至 http://127.0.0.1:8080/a/b/c

当我们在 proxy_pass 后面带上了 / :

|---------------|-----------------------------------------------------------------------| | 1 2 3 | location /a/ { proxy_pass http://127.0.0.1:8080/; # 注意后面的 / } |

我们访问 nginx/a/b/c 时, Nginx 会将在 location 中匹配的参数部分截掉，这样请求转发到的就是 http://127.0.0.1:8080/b/c ,匹配到的 /a/ 被截取掉了。

proxy_pass 使用变量 {#proxy-pass-使用变量}

当我们使用上文说到的，在 upstream 中使用变量来实现动态解析时，上述的行为就变成了这样：
当我们的 proxy_pass 使用变量，不带 / ：

|---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 1 2 3 4 5 6 7 8 9 | # 指定 DNS resolver resolver 8.8.8.8; # 定义一个变量 lb_upstream set $lb_upstream http://gateway-service-alb-xxx.xxx.elb.amazonaws.com; location /a/ { proxy_pass $lb_upstream; } |

我们访问 nginx/a/b/c 时，Nginx 会将请求转发至 http://gateway-service-alb-xxx.xxx.elb.amazonaws.com/a/b/c 这个行为与 proxy_pass 不使用变量是一样的，符合预期。

当我们的 proxy_pass 使用变量，但 upsteam 变量带了 / ：

|---------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 1 2 3 4 5 6 7 8 9 | # 指定 DNS resolver resolver 8.8.8.8; # 定义一个变量 lb_upstream set $lb_upstream http://gateway-service-alb-xxx.xxx.elb.amazonaws.com/; # 注意后面这里带了/ location /a/ { proxy_pass $lb_upstream; } |

我们访问 nginx/a/b/c 时, Nginx 会将请求直接转发至 http://gateway-service-alb-xxx.xxx.elb.amazonaws.com/ ，既不是我们期望的 /b/c 也不是 /a/b/c ，直接转发到了 / 。

那么我们应该怎样去实现我们期望的转发到 /b/c 呢？答案就是不要在变量的尾部添加 / , 转而使用 rewrite 在 location 中重写:

|------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 1 2 3 4 5 6 7 8 9 10 | # 指定 DNS resolver resolver 8.8.8.8; # 定义一个变量 lb_upstream set $lb_upstream http://gateway-service-alb-xxx.xxx.elb.amazonaws.com; location /a/ { rewrite ^/a/(.*) /$1 break; proxy_pass $lb_upstream; } |

上述配置，当我们访问 nginx/a/b/c 时, Nginx 会将请求直接转发至 http://gateway-service-alb-xxx.xxx.elb.amazonaws.com/b/c

其他方法 {#其他方法}

除去上述 Nginx 原生的方案，我们还有很多选择：

ngx_http_upstream_dynamic_module {#ngx-http-upstream-dynamic-module}

Alibaba 的 Tengine 实现了一个动态 upstream 模块：
ngx_http_upstream_dynamic_module | Tengine

The 'fail_timeout' parameter specifies how long time tengine considers the DNS server as unavailiable if a DNS query fails for a server in the upstream. In this period of time, all requests comming will follow what 'fallback' specifies.

只需要使用如下配置即可：

|------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------| | 1 2 3 4 5 6 7 8 9 10 11 12 | upstream backend { dynamic_resolve fallback=stale fail_timeout=30s; server a.com; server b.com; } server { ... proxy_pass http://backend; } |

这个模块提供了 failback 机制，你如果使用的是 Tengine 的话，这将是个比较优雅的解决方案。
值得注意的是，在 Tengine 2.3 开始，这个模块并不内置，在后续的版本里，你可能需要重新编译。

使用 ngx_upstream_jdomain {#使用-ngx-upstream-jdomain}

ngx_upstream_jdomain | Nginx
ngx_upstream_jdomain | Github
该模块默认情况下，会每秒做一次 DNS 解析。

使用 nginx-upstream-dynamic-servers {#使用-nginx-upstream-dynamic-servers}

nginx-upstream-dynamic-servers
该模块在第一次启动的时候会进行一次解析，之后遵循 TTL 再次发起解析请求。

Nginx Plus {#Nginx-Plus}

Nginx Plus 是商业版本，提供了动态解析的特性：
http-load-balancer

参考文档 {#参考文档}

Tengine Github
ngx_http_upstream_dynamic_module | Tengine
resolver
Module ngx_http_proxy_module
Nginx with dynamic upstreams
NGINX proxy_pass to ELB with Variable

51工具盒子

Nginx proxy_pass到AWS ALB的504问题