ECS の Daemon サービスで動かしている Fluentd を Logging Driver の接続先にすると Draining 時にコンテナが終了しないことがある問題とその対策

標題のとおり、ECS の daemon サービスで動かしている Fluentd を logging driver の接続先にすると、draining 時に Docker container with fluentd logging driver never stops if it can’t connect to fluentd · Issue #44511 · moby/moby で言及している問題が起きて、Fluentd が復活しない限り、logging driver で Fluentd を利用しているコンテナが無限に生き続けることになります。
本エントリーでは問題の再現方法と対策について解説します。

Issue についての解説

どうしてコンテナが終了しないかについての自分なりの仮説はコメントに書いたとおりです。
次のように 3 つの goroutine が関係していそうです。

  1. goroutine 1: Fluentd のコネクションにログを書き込む際にロガーのロックを取るが、Fluentd が起動していないためロックを抱えたまま何億回も接続を確立しようとリトライする
  2. goroutine 2: コンテナを終了するためにコンテナのロックを取ってから logging driver を close しようとするが、ロガーのロックが取れず待機
  3. goroutine 3: コンテナを強制終了するために PID を取得しようとするが、その際にコンテナのロックが取れず待機

上記の仮説が正しければ、fluentd-async"true" を指定したり、fluentd-max-retries に小さな値を指定したりすれば、一応コンテナを終了させることはできるんじゃないかと思います。

再現手順

次の設定ファイルを使って terraform apply し、ECS インスタンスに 2 つのタスクが配置された後にこのインスタンスを draining すると再現します。

provider "aws" {
  region = "ap-northeast-1"
}


locals {
  namespace = "moby-issue-44511"
}

variable "key_name" {}

data "aws_vpc" "default" {
  default = true
}


#--------------#
# ECS instance #
#--------------#

resource "aws_ecs_cluster" "main" {
  name = local.namespace
}

resource "aws_iam_instance_profile" "AmazonEC2ContainerServiceforEC2Role" {
  name = "AmazonEC2ContainerServiceforEC2Role"
  role = aws_iam_role.AmazonEC2ContainerServiceforEC2Role.name
}

resource "aws_iam_role" "AmazonEC2ContainerServiceforEC2Role" {
  name = "AmazonEC2ContainerServiceforEC2Role"

  assume_role_policy = <<-EOF
  {
    "Version":"2012-10-17",
    "Statement":[
      {
        "Effect":"Allow",
        "Principal": {
          "Service": [
            "ec2.amazonaws.com"
          ]
        },
        "Action":"sts:AssumeRole"
      }
    ]
  }
  EOF
}

resource "aws_iam_role_policy_attachment" "AmazonEC2ContainerServiceforEC2Role" {
  role       = aws_iam_role.AmazonEC2ContainerServiceforEC2Role.id
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role"
}


resource "aws_security_group" "allow_ssh" {
  name   = "${local.namespace}-allow-ssh"
  vpc_id = data.aws_vpc.default.id

  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_launch_template" "main" {
  name = local.namespace

  image_id = "ami-0fcfb0266308303f6"

  instance_type = "t3.micro"

  iam_instance_profile {
    name = aws_iam_instance_profile.AmazonEC2ContainerServiceforEC2Role.name
  }

  key_name = var.key_name

  network_interfaces {
    associate_public_ip_address = true
    security_groups             = [aws_security_group.allow_ssh.id]
  }

  instance_market_options {
    market_type = "spot"
  }

  user_data = base64encode(<<-DATA
  #!/bin/bash

  cat <<'EOF' >> /etc/ecs/ecs.config
  ECS_CLUSTER=${aws_ecs_cluster.main.name}
  ECS_AVAILABLE_LOGGING_DRIVERS=["json-file","syslog","awslogs","fluentd"]
  ECS_LOGLEVEL=debug
  EOF

  echo '{"debug": true}' >/etc/docker/daemon.json
  systemctl restart docker --no-block
  DATA
  )
}

resource "aws_instance" "main" {
  launch_template {
    id      = aws_launch_template.main.id
    version = "$Latest"
  }

  # Set iam_instance_profile explicitly to avoid the bug that will be fixed
  # by https://github.com/hashicorp/terraform-provider-aws/pull/27972.
  iam_instance_profile = aws_iam_instance_profile.AmazonEC2ContainerServiceforEC2Role.name

  tags = {
    Name = local.namespace
  }

  lifecycle {
    ignore_changes = [
      # Although the user data of the launch template is base64-encoded,
      # terraform-provider-aws cannot know that.
      user_data,
      # The latest version at the time the instance is created, not "$Latest" is set
      launch_template["version"],
    ]
  }
}


#---------#
# fluentd #
#---------#

resource "aws_ecs_task_definition" "fluentd" {
  family                = "${local.namespace}-fluentd"
  container_definitions = <<-JSON
  [
    {
      "name": "fluentd",
      "image": "fluent/fluentd:v1.15.3-1.0",
      "essential": true,
      "memoryReservation": 50,
      "portMappings": [
        { "containerPort": 24224, "hostPort": 24224 }
      ]
    }
  ]
  JSON
}

resource "aws_ecs_service" "fluentd" {
  name                = aws_ecs_task_definition.fluentd.family
  cluster             = aws_ecs_cluster.main.name
  task_definition     = aws_ecs_task_definition.fluentd.arn
  scheduling_strategy = "DAEMON"
}


#-------------#
# application #
#-------------#

resource "aws_ecs_task_definition" "hello" {
  family                = "${local.namespace}-application"
  container_definitions = <<-JSON
  [
    {
      "name": "hello",
      "image": "ubuntu",
      "essential": true,
      "command": ["sh", "-c", "while :; do echo \"$(date +%FT%T): hello\"; sleep 1; done"],
      "stopTimeout": 30,
      "logConfiguration": {
        "logDriver": "fluentd",
        "options": {
          "fluentd-address": "${aws_instance.main.private_ip}"
        }
      },
      "memoryReservation": 50
    }
  ]
  JSON
}

resource "aws_ecs_service" "hello" {
  name            = aws_ecs_task_definition.hello.family
  cluster         = aws_ecs_cluster.main.name
  task_definition = aws_ecs_task_definition.hello.arn
  desired_count   = 1

  deployment_minimum_healthy_percent = 0
}

上記の hello コンテナではシグナルハンドリングをしていないので、少なくとも stopTimeout の 30 秒が経つまでコンテナは終了しません。30 秒経つと docker は強制的にコンテナを終了しようとします。
その様子はログからも確認できます。

$ journalctl -u docker -o json | jq -r .MESSAGE
-- snip --
time="2022-11-23T11:02:52.029418846Z" level=debug msg="Calling POST /v1.21/containers/d3030f9448430fc0be31d74cb4651b0f723043b6645ebc53ae0f84345adfaa15/stop?t=30"
time="2022-11-23T11:02:52.029494540Z" level=debug msg="Sending kill signal 15 to container d3030f9448430fc0be31d74cb4651b0f723043b6645ebc53ae0f84345adfaa15"
time="2022-11-23T11:02:59.372092034Z" level=debug msg="Running health check for container be5a2c16b818ecd5893d810710cf94c6cc24e15b14ac8966f9d8eb401d12d1fd ..."
time="2022-11-23T11:02:59.372551803Z" level=debug msg="starting exec command 888c6ff102d55ba1b8ce00464569eeb670d418c9d14a27fad966fb381ed78de1 in container be5a2c16b818ecd5893d810710cf94c6cc24e15b14ac8966f9d8eb401d12d1fd"
time="2022-11-23T11:02:59.374900092Z" level=debug msg="attach: stdout: begin"
time="2022-11-23T11:02:59.374939402Z" level=debug msg="attach: stderr: begin"
time="2022-11-23T11:02:59.378410573Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exec-added
time="2022-11-23T11:02:59.729173160Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exec-started
time="2022-11-23T11:02:59.758665471Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exit
time="2022-11-23T11:02:59.758851692Z" level=debug msg="attach: stdout: end"
time="2022-11-23T11:02:59.758874503Z" level=debug msg="attach: stderr: end"
time="2022-11-23T11:02:59.758902425Z" level=debug msg="attach done"
time="2022-11-23T11:02:59.758927682Z" level=debug msg="Health check for container be5a2c16b818ecd5893d810710cf94c6cc24e15b14ac8966f9d8eb401d12d1fd done (exitCode=0)"
time="2022-11-23T11:03:00.189947282Z" level=debug msg="Calling POST /v1.21/containers/e6cec48166c665fe4bffb9dceac44595953afc7fcd5bb727588ba62ff451bbbb/stop?t=30"
time="2022-11-23T11:03:00.190024918Z" level=debug msg="Sending kill signal 15 to container e6cec48166c665fe4bffb9dceac44595953afc7fcd5bb727588ba62ff451bbbb"
time="2022-11-23T11:03:01.541963842Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exit
time="2022-11-23T11:03:01.562363324Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/delete
time="2022-11-23T11:03:01.562396747Z" level=info msg="ignoring event" container=e6cec48166c665fe4bffb9dceac44595953afc7fcd5bb727588ba62ff451bbbb module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
time="2022-11-23T11:03:01.574352549Z" level=debug msg="Revoking external connectivity on endpoint ecs-moby-issue-44511-fluentd-5-fluentd-8490d7a8bce5fac7dd01 (d03613098372f77002e9c8f4c972358e227f03fa4e0875288e2a7fd2c6344e5a)"
time="2022-11-23T11:03:01.574995334Z" level=debug msg="/usr/sbin/iptables, [--wait -t nat -C DOCKER -p tcp -d 0/0 --dport 24224 -j DNAT --to-destination 172.17.0.2:24224 ! -i docker0]"
time="2022-11-23T11:03:01.575847232Z" level=debug msg="Calling GET /v1.21/containers/e6cec48166c665fe4bffb9dceac44595953afc7fcd5bb727588ba62ff451bbbb/json"
time="2022-11-23T11:03:01.577553743Z" level=debug msg="/usr/sbin/iptables, [--wait -t nat -D DOCKER -p tcp -d 0/0 --dport 24224 -j DNAT --to-destination 172.17.0.2:24224 ! -i docker0]"
time="2022-11-23T11:03:01.581373824Z" level=debug msg="/usr/sbin/iptables, [--wait -t filter -C DOCKER ! -i docker0 -o docker0 -p tcp -d 172.17.0.2 --dport 24224 -j ACCEPT]"
time="2022-11-23T11:03:01.588964441Z" level=debug msg="/usr/sbin/iptables, [--wait -t filter -D DOCKER ! -i docker0 -o docker0 -p tcp -d 172.17.0.2 --dport 24224 -j ACCEPT]"
time="2022-11-23T11:03:01.598819481Z" level=debug msg="/usr/sbin/iptables, [--wait -t nat -C POSTROUTING -p tcp -s 172.17.0.2 -d 172.17.0.2 --dport 24224 -j MASQUERADE]"
time="2022-11-23T11:03:01.602651989Z" level=debug msg="/usr/sbin/iptables, [--wait -t nat -D POSTROUTING -p tcp -s 172.17.0.2 -d 172.17.0.2 --dport 24224 -j MASQUERADE]"
time="2022-11-23T11:03:01.606894842Z" level=debug msg="DeleteConntrackEntries purged ipv4:1, ipv6:0"
time="2022-11-23T11:03:01.689072479Z" level=debug msg="Releasing addresses for endpoint ecs-moby-issue-44511-fluentd-5-fluentd-8490d7a8bce5fac7dd01's interface on network bridge"
time="2022-11-23T11:03:01.689111084Z" level=debug msg="ReleaseAddress(LocalDefault/172.17.0.0/16, 172.17.0.2)"
time="2022-11-23T11:03:01.689137776Z" level=debug msg="Released address PoolID:LocalDefault/172.17.0.0/16, Address:172.17.0.2 Sequence:App: ipam/default/data, ID: LocalDefault/172.17.0.0/16, DBIndex: 0x0, Bits: 65536, Unselected: 65531, Sequence: (0xe8000000, 1)->(0x0, 2046)->(0x1, 1)->end Curr:5"
time="2022-11-23T11:03:01.796276657Z" level=debug msg="Calling GET /v1.21/containers/e6cec48166c665fe4bffb9dceac44595953afc7fcd5bb727588ba62ff451bbbb/json"
time="2022-11-23T11:03:01.797123648Z" level=debug msg="Calling GET /v1.21/containers/e6cec48166c665fe4bffb9dceac44595953afc7fcd5bb727588ba62ff451bbbb/json"
time="2022-11-23T11:03:22.052640413Z" level=info msg="Container failed to exit within 30s of signal 15 - using the force" container=d3030f9448430fc0be31d74cb4651b0f723043b6645ebc53ae0f84345adfaa15
time="2022-11-23T11:03:22.052683086Z" level=debug msg="Sending kill signal 9 to container d3030f9448430fc0be31d74cb4651b0f723043b6645ebc53ae0f84345adfaa15"
time="2022-11-23T11:03:22.125742131Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exit
time="2022-11-23T11:03:22.160118590Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/delete
time="2022-11-23T11:03:22.160152542Z" level=info msg="ignoring event" container=d3030f9448430fc0be31d74cb4651b0f723043b6645ebc53ae0f84345adfaa15 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
-- snip --

上記のログでは、コンテナ ID = d3030f944843 が hello コンテナで、e6cec48166c6 が fluentd コンテナです。ログから docker が次のように動作したことがわかります。

  1. 2022-11-23T11:02:52.029494540Z: hello コンテナに SIGTERM (signal 15) を送る
  2. 2022-11-23T11:03:00.190024918Z: fluentd コンテナに SIGTERM (signal 15) を送る
  3. 2022-11-23T11:03:22.052683086Z: hello コンテナに SIGKILL (signal 9) を送る

各タスクの詳細は次のとおりです。

% TZ=UTC aws ecs describe-tasks --cluster moby-issue-44511 \
  --tasks ecb4495fd45343e88e5649a9d2f4eac6 f64e3d5e2f654fbeaac060acf07f7871 \
  | jq '[.tasks[] | { group, containerId: .containers[0].runtimeId, desiredStatus, lastStatus, createdAt, stoppingAt, stoppedAt, stoppedReason, containerStoppedReason: .containers[0].reason }]'
[
  {
    "group": "service:moby-issue-44511-application",
    "containerId": "d3030f9448430fc0be31d74cb4651b0f723043b6645ebc53ae0f84345adfaa15",
    "desiredStatus": "STOPPED",
    "lastStatus": "STOPPED",
    "createdAt": "2022-11-23T11:01:43.145000+00:00",
    "stoppingAt": "2022-11-23T11:02:51.646000+00:00",
    "stoppedAt": "2022-11-23T11:15:28.866000+00:00",
    "stoppedReason": "Service moby-issue-44511-application: container instance is in DRAINING state",
    "containerStoppedReason": "DockerTimeoutError: Could not transition to stopped; timed out after waiting 2m30s"
  },
  {
    "group": "service:moby-issue-44511-fluentd",
    "containerId": "e6cec48166c665fe4bffb9dceac44595953afc7fcd5bb727588ba62ff451bbbb",
    "desiredStatus": "STOPPED",
    "lastStatus": "STOPPED",
    "createdAt": "2022-11-23T11:00:49.831000+00:00",
    "stoppingAt": "2022-11-23T11:02:58.827000+00:00",
    "stoppedAt": "2022-11-23T11:03:01.819000+00:00",
    "stoppedReason": "Servicemoby-issue-44511-fluentd: container instance is in DRAINING state",
    "containerStoppedReason": null
  }
]

上記の結果を見ると、fluentd タスクの方が先に終了していることがわかります。また、application タスクの方も無事停止しているように見えますが、実際はコンテナは生きたままです。

[ec2-user@ip-172-31-16-91 ~]$ date
Wed Nov 23 11:48:08 UTC 2022
[ec2-user@ip-172-31-16-91 ~]$ docker container ls
CONTAINER ID   IMAGE                            COMMAND                  CREATED             STATUS                       PORTS     NAMES
d3030f944843   ubuntu                           "sh -c 'while :; do …"   46 minutes ago      Up 46 minutes                          ecs-moby-issue-44511-application-7-hello-8adde2899ef48c97b401
be5a2c16b818   amazon/amazon-ecs-agent:latest   "/agent"                 About an hour ago   Up About an hour (healthy)             ecs-agent

対策

ECS インスタンスを draining した場合に fluentd 以外のタスクが全て終了してから fluentd を停止させれば良いです。
最初は wrapper program を書こうかと思ったんですが、同じ悩みを抱えている方が別の解決策を取っていたのでそのやり方を採用してみます。

ECSのDAEMONをDRAININGで直ぐに停止しないようにした - Sansan Tech Blog

元々の fluentd のタスク定義は次のようになっていましたが

resource "aws_ecs_task_definition" "fluentd" {
  family                = "${local.namespace}-fluentd"
  container_definitions = <<-JSON
  [
    {
      "name": "fluentd",
      "image": "fluent/fluentd:v1.15.3-1.0",
      "essential": true,
      "memoryReservation": 50,
      "portMappings": [
        { "containerPort": 24224, "hostPort": 24224 }
      ]
    }
  ]
  JSON
}

次のように変更します。

resource "aws_ecs_task_definition" "fluentd" {
  family                = "${local.namespace}-fluentd"
  container_definitions = <<-JSON
  [
    {
      "name": "fluentd",
      "image": "fluent/fluentd:v1.15.3-1.0",
      "essential": true,
      "memoryReservation": 50,
      "portMappings": [
        { "containerPort": 24224, "hostPort": 24224 }
      ]
    },
    {
      "name": "ecs-daemon-protector",
      "image": "iamwaneal/ecs-daemon-protector@sha256:6da000b4bf5e6878dfbad80b8cb82f05a514ff7751f3291841568078df247a7c",
      "dependsOn": [
        {
          "containerName": "fluentd",
          "condition": "START"
        }
      ]
    }
  ]
  JSON
}

また、必要な権限を追加します。ロール名が AmazonEC2ContainerServiceforEC2Role のままなのは気持ち悪いですが、検証用途なので気にしないことにします。

resource "aws_iam_role_policy" "ecs_daemon_protector" {
  name = "ecs-daemon-protector"
  role = aws_iam_role.AmazonEC2ContainerServiceforEC2Role.id

  policy = <<-EOF
  {
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Action": [
          "ecs:DescribeContainerInstances",
          "ecs:DescribeServices",
          "ecs:DescribeTasks",
          "ecs:ListContainerInstances",
          "ecs:ListServices",
          "ecs:ListTasks"
        ],
        "Resource": "*"
      }
    ]
  }
  EOF
}

次のログは ECS インスタンスを draining にした時のログです。

% journalctl -u docker -o json | jq -r .MESSAGE
-- snip --
time="2022-11-23T16:32:58.600810655Z" level=debug msg="Calling POST /v1.21/containers/55add885ec06b5da86e1a01302bee4b1c95bdc53a74e5fe268b30a05c86462a3/stop?t=30"
time="2022-11-23T16:32:58.600881633Z" level=debug msg="Sending kill signal 15 to container 55add885ec06b5da86e1a01302bee4b1c95bdc53a74e5fe268b30a05c86462a3"
time="2022-11-23T16:33:00.594474625Z" level=debug msg="Calling POST /v1.21/containers/e08afe646cb171a4366381ae7273e8fd13537f1d2a3677bbf0f8c574a9f4f2a5/stop?t=60"
time="2022-11-23T16:33:00.594537016Z" level=debug msg="Sending kill signal 15 to container e08afe646cb171a4366381ae7273e8fd13537f1d2a3677bbf0f8c574a9f4f2a5"
time="2022-11-23T16:33:02.385822631Z" level=debug msg="Calling GET /v1.21/containers/7d8cbbe6f7e056f4fb677633c10d3bfed31749f7563006b7fe5100ece2bf51ef/stats?stream=1"
time="2022-11-23T16:33:03.454707939Z" level=debug msg="Calling GET /v1.21/containers/e08afe646cb171a4366381ae7273e8fd13537f1d2a3677bbf0f8c574a9f4f2a5/stats?stream=1"
time="2022-11-23T16:33:03.990980151Z" level=debug msg="Calling GET /v1.21/containers/55add885ec06b5da86e1a01302bee4b1c95bdc53a74e5fe268b30a05c86462a3/stats?stream=1"
time="2022-11-23T16:33:11.160810958Z" level=debug msg="Running health check for container be5a2c16b818ecd5893d810710cf94c6cc24e15b14ac8966f9d8eb401d12d1fd ..."
time="2022-11-23T16:33:11.160942303Z" level=debug msg="starting exec command ed20873b8284e03235eaed7b82810943e9fc27a07574d48a2c314dc1a038a104 in container be5a2c16b818ecd5893d810710cf94c6cc24e15b14ac8966f9d8eb401d12d1fd"
time="2022-11-23T16:33:11.163104354Z" level=debug msg="attach: stdout: begin"
time="2022-11-23T16:33:11.163127192Z" level=debug msg="attach: stderr: begin"
time="2022-11-23T16:33:11.179469927Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exec-added
time="2022-11-23T16:33:11.656820563Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exec-started
time="2022-11-23T16:33:11.689799328Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exit
time="2022-11-23T16:33:11.689982652Z" level=debug msg="attach: stdout: end"
time="2022-11-23T16:33:11.689999787Z" level=debug msg="attach: stderr: end"
time="2022-11-23T16:33:11.690028133Z" level=debug msg="attach done"
time="2022-11-23T16:33:11.690054723Z" level=debug msg="Health check for container be5a2c16b818ecd5893d810710cf94c6cc24e15b14ac8966f9d8eb401d12d1fd done (exitCode=0)"
time="2022-11-23T16:33:25.581998420Z" level=debug msg="Calling HEAD /_ping"
time="2022-11-23T16:33:28.624113389Z" level=info msg="Container failed to exit within 30s of signal 15 - using the force" container=55add885ec06b5da86e1a01302bee4b1c95bdc53a74e5fe268b30a05c86462a3
time="2022-11-23T16:33:28.624600967Z" level=debug msg="Sending kill signal 9 to container 55add885ec06b5da86e1a01302bee4b1c95bdc53a74e5fe268b30a05c86462a3"
time="2022-11-23T16:33:28.669807373Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exit
time="2022-11-23T16:33:28.691661030Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/delete
time="2022-11-23T16:33:28.691693014Z" level=info msg="ignoring event" container=55add885ec06b5da86e1a01302bee4b1c95bdc53a74e5fe268b30a05c86462a3 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
time="2022-11-23T16:33:28.710003493Z" level=debug msg="Calling GET /v1.21/containers/55add885ec06b5da86e1a01302bee4b1c95bdc53a74e5fe268b30a05c86462a3/json"
time="2022-11-23T16:33:28.710214466Z" level=debug msg="Revoking external connectivity on endpoint ecs-moby-issue-44511-application-7-hello-8ef4ebb89fb8ddabe801 (af30a24de76c98f322b732eb8f0607fd3e7dbc1081965d77aa596d4033fa6b67)"
time="2022-11-23T16:33:28.710989827Z" level=debug msg="DeleteConntrackEntries purged ipv4:0, ipv6:0"
time="2022-11-23T16:33:28.801219714Z" level=debug msg="Releasing addresses for endpoint ecs-moby-issue-44511-application-7-hello-8ef4ebb89fb8ddabe801's interface on network bridge"
time="2022-11-23T16:33:28.801255110Z" level=debug msg="ReleaseAddress(LocalDefault/172.17.0.0/16, 172.17.0.5)"
time="2022-11-23T16:33:28.801281090Z" level=debug msg="Released address PoolID:LocalDefault/172.17.0.0/16, Address:172.17.0.5 Sequence:App: ipam/default/data, ID: LocalDefault/172.17.0.0/16, DBIndex: 0x0, Bits: 65536, Unselected: 65529, Sequence: (0xfc000000, 1)->(0x0, 2046)->(0x1, 1)->end Curr:6"
time="2022-11-23T16:33:28.869477226Z" level=debug msg="Calling GET /v1.21/containers/55add885ec06b5da86e1a01302bee4b1c95bdc53a74e5fe268b30a05c86462a3/json"
time="2022-11-23T16:33:28.872855536Z" level=debug msg="Calling GET /v1.21/containers/55add885ec06b5da86e1a01302bee4b1c95bdc53a74e5fe268b30a05c86462a3/json"
time="2022-11-23T16:33:31.394374108Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exit
time="2022-11-23T16:33:31.407158227Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/delete
time="2022-11-23T16:33:31.407270649Z" level=info msg="ignoring event" container=e08afe646cb171a4366381ae7273e8fd13537f1d2a3677bbf0f8c574a9f4f2a5 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
time="2022-11-23T16:33:31.462495321Z" level=debug msg="Calling GET /v1.21/containers/e08afe646cb171a4366381ae7273e8fd13537f1d2a3677bbf0f8c574a9f4f2a5/json"
time="2022-11-23T16:33:31.462960602Z" level=debug msg="Revoking external connectivity on endpoint ecs-moby-issue-44511-fluentd-9-ecs-daemon-protector-8082e0b9e3a0d8890d00 (fdfa544690ae43c81b0d272b36b61fc2a3e84cde6af79cb567b19d552fe5bc4a)"
time="2022-11-23T16:33:31.464711815Z" level=debug msg="DeleteConntrackEntries purged ipv4:1, ipv6:0"
time="2022-11-23T16:33:31.580839512Z" level=debug msg="Releasing addresses for endpoint ecs-moby-issue-44511-fluentd-9-ecs-daemon-protector-8082e0b9e3a0d8890d00's interface on network bridge"
time="2022-11-23T16:33:31.580878245Z" level=debug msg="ReleaseAddress(LocalDefault/172.17.0.0/16, 172.17.0.3)"
time="2022-11-23T16:33:31.580905793Z" level=debug msg="Released address PoolID:LocalDefault/172.17.0.0/16, Address:172.17.0.3 Sequence:App: ipam/default/data, ID: LocalDefault/172.17.0.0/16, DBIndex: 0x0, Bits: 65536, Unselected: 65530, Sequence: (0xf8000000, 1)->(0x0, 2046)->(0x1, 1)->end Curr:6"
time="2022-11-23T16:33:31.671308358Z" level=debug msg="Calling GET /v1.21/containers/e08afe646cb171a4366381ae7273e8fd13537f1d2a3677bbf0f8c574a9f4f2a5/json"
time="2022-11-23T16:33:31.676579087Z" level=debug msg="Calling GET /v1.21/containers/e08afe646cb171a4366381ae7273e8fd13537f1d2a3677bbf0f8c574a9f4f2a5/json"
time="2022-11-23T16:33:31.685018931Z" level=debug msg="Calling POST /v1.21/containers/7d8cbbe6f7e056f4fb677633c10d3bfed31749f7563006b7fe5100ece2bf51ef/stop?t=30"
time="2022-11-23T16:33:31.685095206Z" level=debug msg="Sending kill signal 15 to container 7d8cbbe6f7e056f4fb677633c10d3bfed31749f7563006b7fe5100ece2bf51ef"
time="2022-11-23T16:33:32.837157890Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exit
time="2022-11-23T16:33:32.863129855Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/delete
time="2022-11-23T16:33:32.863166595Z" level=info msg="ignoring event" container=7d8cbbe6f7e056f4fb677633c10d3bfed31749f7563006b7fe5100ece2bf51ef module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
time="2022-11-23T16:33:32.895880772Z" level=debug msg="Revoking external connectivity on endpoint ecs-moby-issue-44511-fluentd-9-fluentd-d8fafae7fab5a6f85200 (f202415ea19ff26a858bd28f24d9536837c4283dd7dd7c33ea828cfa3df9e01a)"
time="2022-11-23T16:33:32.901781381Z" level=debug msg="/usr/sbin/iptables, [--wait -t nat -C DOCKER -p tcp -d 0/0 --dport 24224 -j DNAT --to-destination 172.17.0.2:24224 ! -i docker0]"
time="2022-11-23T16:33:32.901855554Z" level=debug msg="Calling GET /v1.21/containers/7d8cbbe6f7e056f4fb677633c10d3bfed31749f7563006b7fe5100ece2bf51ef/json"
time="2022-11-23T16:33:32.904022305Z" level=debug msg="/usr/sbin/iptables, [--wait -t nat -D DOCKER -p tcp -d 0/0 --dport 24224 -j DNAT --to-destination 172.17.0.2:24224 ! -i docker0]"
time="2022-11-23T16:33:32.907646778Z" level=debug msg="/usr/sbin/iptables, [--wait -t filter -C DOCKER ! -i docker0 -o docker0 -p tcp -d 172.17.0.2 --dport 24224 -j ACCEPT]"
time="2022-11-23T16:33:32.911546064Z" level=debug msg="/usr/sbin/iptables, [--wait -t filter -D DOCKER ! -i docker0 -o docker0 -p tcp -d 172.17.0.2 --dport 24224 -j ACCEPT]"
time="2022-11-23T16:33:32.918312060Z" level=debug msg="/usr/sbin/iptables, [--wait -t nat -C POSTROUTING -p tcp -s 172.17.0.2 -d 172.17.0.2 --dport 24224 -j MASQUERADE]"
time="2022-11-23T16:33:32.919867917Z" level=debug msg="/usr/sbin/iptables, [--wait -t nat -D POSTROUTING -p tcp -s 172.17.0.2 -d 172.17.0.2 --dport 24224 -j MASQUERADE]"
time="2022-11-23T16:33:32.922380159Z" level=debug msg="DeleteConntrackEntries purged ipv4:1, ipv6:0"
time="2022-11-23T16:33:33.036928666Z" level=debug msg="Releasing addresses for endpoint ecs-moby-issue-44511-fluentd-9-fluentd-d8fafae7fab5a6f85200's interface on network bridge"
time="2022-11-23T16:33:33.036969823Z" level=debug msg="ReleaseAddress(LocalDefault/172.17.0.0/16, 172.17.0.2)"
time="2022-11-23T16:33:33.036995247Z" level=debug msg="Released address PoolID:LocalDefault/172.17.0.0/16, Address:172.17.0.2 Sequence:App: ipam/default/data, ID: LocalDefault/172.17.0.0/16, DBIndex: 0x0, Bits: 65536, Unselected: 65531, Sequence: (0xe8000000, 1)->(0x0, 2046)->(0x1, 1)->end Curr:6"
time="2022-11-23T16:33:33.118298509Z" level=debug msg="Calling GET /v1.21/containers/7d8cbbe6f7e056f4fb677633c10d3bfed31749f7563006b7fe5100ece2bf51ef/json"
time="2022-11-23T16:33:33.118306244Z" level=debug msg="Calling GET /v1.21/containers/7d8cbbe6f7e056f4fb677633c10d3bfed31749f7563006b7fe5100ece2bf51ef/json"

上記のログでは、コンテナ ID = 55add885ec06 が hello コンテナで、7d8cbbe6f7e0 が fluentd コンテナ、e08afe646cb1 が ecs-daemon-protector です。ログから docker が次のように動作したことがわかります。

  1. 2022-11-23T16:32:58.600881633Z: hello コンテナに SIGTERM (signal 15) を送る
  2. 2022-11-23T16:33:00.594537016Z: ecs-daemon-protector コンテナに SIGTERM (signal 15) を送る
  3. 2022-11-23T16:33:28.624600967Z: hello コンテナに SIGKILL (signal 9) を送る
  4. 2022-11-23T16:33:31.685095206Z: fluentd コンテナに SIGTERM (signal 15) を送る

各タスクの詳細は次のとおりです。application タスクの後に fluentd タスクが終了しているので期待どおりですね。

% TZ=UTC aws ecs describe-tasks --cluster moby-issue-44511 \
  --tasks 3c9d1bfca2a04095bfc58f52f5154955 45e8d21609834e2da2d6ad0a22d51de3 \
  | jq '[.tasks[] | { group, containerId: .containers[0].runtimeId, desiredStatus, lastStatus, createdAt, stoppingAt, stoppedAt, stoppedReason, containerStoppedReason: .containers[0].reason }]'
[
  {
    "group": "service:moby-issue-44511-application",
    "containerId": "55add885ec06b5da86e1a01302bee4b1c95bdc53a74e5fe268b30a05c86462a3",
    "desiredStatus": "STOPPED",
    "lastStatus": "STOPPED",
    "createdAt": "2022-11-23T16:31:45.322000+00:00",
    "stoppingAt": "2022-11-23T16:32:57.071000+00:00",
    "stoppedAt": "2022-11-23T16:33:28.885000+00:00",
    "stoppedReason": "Service moby-issue-44511-application: container instance is in DRAINING state",
    "containerStoppedReason": null
  },
  {
    "group": "service:moby-issue-44511-fluentd",
    "containerId": "7d8cbbe6f7e056f4fb677633c10d3bfed31749f7563006b7fe5100ece2bf51ef",
    "desiredStatus": "STOPPED",
    "lastStatus": "STOPPED",
    "createdAt": "2022-11-23T16:31:43.798000+00:00",
    "stoppingAt": "2022-11-23T16:32:59.612000+00:00",
    "stoppedAt": "2022-11-23T16:33:33.122000+00:00",
    "stoppedReason": "Servicemoby-issue-44511-fluentd: container instance is in DRAINING state",
    "containerStoppedReason": null
  }
]

補足

前述の対策は [ECS] [request]: Tasks scheduled with the DAEMON strategy should not be stopped when container instance is placed into DRAINING · Issue #128 · aws/containers-roadmap が解消することで不要になる予定だったと思われます。

Improving daemon services in Amazon ECS にも次のように、daemon タスクは最後に終了するようにしたと言及されています。

In November of 2020, we added an update to the ECS scheduler to drain daemon tasks last on a host. This ensures that when an instance is set to drain, ECS will terminate all other tasks before terminating the daemon task.

ところが、どうも次のようなことが今でも起きるようです。

  1. ECS サービスのタスクが DEACTIVATING になる
  2. ECS サービスのタスクが STOPPING になる
    • この時点で stopsignal が送られ、ECS サービスに属していたタスクはサービスから外れたものと認識されるのかもしれない
    • タスク自体はまだ終了したわけではない
  3. Daemon サービスのタスクが STOPPING になる
    • この時点で stopsignal が送られる
    • ECS サービスに属している running タスクがある間は終了しないが、ECS サービスに属していない running タスクのことは考慮しないらしい
  4. Daemon サービスのタスクが STOPPED になる
  5. ECS サービスに属していたタスクが STOPPED になる

DEACTIVATING, STOPPING 等の state については Task lifecycle を参照してください。

regression が起きているのでは?と投げかけている issue があるので、もしかしたらどこかのタイミングで regression があったのかもしれません。
cf. [ECS] [bug]: daemon task should be last to be stopped when instance is draining · Issue #1700 · aws/containers-roadmap