โŒ

Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

How to get an alarm when there are no logs for a time period in AWS Cloudwatch?

I have a Java application that runs in AWS Elastic Container Service. Application polls a queue periodically. Sometimes there is no response from the queue and the application hanging forever. I have enclosed the methods with try-catch blocks with logging exceptions. Even though there are no logs in the Cloudwatch after that. No exceptions or errors. Is there a way that I can identify this situation. ? (No logs in the Cloudwatch). Like filtering an error log pattern. So I can restart the service. Any trick or solution would be appreciated.

public void handleProcess() {
    try {
        while(true) {
            Response response = QueueUitils.pollQueue(); // poll the queue
            QueueUitils.processMessage(response);
            TimeUnit.SECONDS.sleep(WAIT_TIME); // WAIT_TIME = 20
        }
    } catch (Exception e) {
        LOGGER.error("Data Queue operation failed" + e.getMessage());
        throw e;
    }
}

No GPU EC2 instances associated with AWS Batch

I need to set up GPU backed instances on AWS Batch.

Here's my .yaml file:

  GPULargeLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateData:
        UserData:
          Fn::Base64:
            Fn::Sub: |
              MIME-Version: 1.0
              Content-Type: multipart/mixed; boundary="==BOUNDARY=="

              --==BOUNDARY==
              Content-Type: text/cloud-config; charset="us-ascii"

              runcmd:
                - yum install -y aws-cfn-bootstrap
                - echo ECS_LOGLEVEL=debug >> /etc/ecs/ecs.config
                - echo ECS_IMAGE_CLEANUP_INTERVAL=60m >> /etc/ecs/ecs.config
                - echo ECS_IMAGE_MINIMUM_CLEANUP_AGE=60m >> /etc/ecs/ecs.config
                - /opt/aws/bin/cfn-init -v --region us-west-2 --stack cool_stack --resource LaunchConfiguration
                - echo "DEVS=/dev/xvda" > /etc/sysconfig/docker-storage-setup
                - echo "VG=docker" >> /etc/sysconfig/docker-storage-setup
                - echo "DATA_SIZE=99%FREE" >> /etc/sysconfig/docker-storage-setup
                - echo "AUTO_EXTEND_POOL=yes" >> /etc/sysconfig/docker-storage-setup
                - echo "LV_ERROR_WHEN_FULL=yes" >> /etc/sysconfig/docker-storage-setup
                - echo "EXTRA_STORAGE_OPTIONS=\"--storage-opt dm.fs=ext4 --storage-opt dm.basesize=64G\"" >> /etc/sysconfig/docker-storage-setup
                - /usr/bin/docker-storage-setup
                - yum update -y
                - echo "OPTIONS=\"--default-ulimit nofile=1024000:1024000 --storage-opt dm.basesize=64G\"" >> /etc/sysconfig/docker
                - /etc/init.d/docker restart

              --==BOUNDARY==--
      LaunchTemplateName: GPULargeLaunchTemplate

  GPULargeBatchComputeEnvironment:
    DependsOn:
      - ComputeRole
      - ComputeInstanceProfile
    Type: AWS::Batch::ComputeEnvironment
    Properties:
      Type: MANAGED
      ComputeResources:
        ImageId: ami-GPU-optimized-AMI-ID
        AllocationStrategy: BEST_FIT_PROGRESSIVE
        LaunchTemplate:
          LaunchTemplateId:
            Ref: GPULargeLaunchTemplate
          Version:
            Fn::GetAtt:
              - GPULargeLaunchTemplate
              - LatestVersionNumber
        InstanceRole:
          Ref: ComputeInstanceProfile
        InstanceTypes:
          - g4dn.xlarge
        MaxvCpus: 768
        MinvCpus: 1
        SecurityGroupIds:
          - Fn::GetAtt:
              - ComputeSecurityGroup
              - GroupId
        Subnets:
          - Ref: ComputePrivateSubnetA
        Type: EC2
        UpdateToLatestImageVersion: True

  MyGPUBatchJobQueue:
    Type: AWS::Batch::JobQueue
    Properties:
      ComputeEnvironmentOrder:
        - ComputeEnvironment:
            Ref: GPULargeBatchComputeEnvironment
          Order: 1
      Priority: 5
      JobQueueName: MyGPUBatchJobQueue
      State: ENABLED

  MyGPUJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      Type: container
      ContainerProperties:
        Command:
          - "/opt/bin/python3"
          - "/opt/bin/start.py"
          - "--retry_count"
          - "Ref::batchRetryCount"
          - "--retry_limit"
          - "Ref::batchRetryLimit"
        Environment:
          - Name: "Region"
            Value: "us-west-2"
          - Name: "LANG"
            Value: "en_US.UTF-8"
        Image:
          Fn::Sub: "cool_1234_abc.dkr.ecr.us-west-2.amazonaws.com/my-image"
        JobRoleArn:
          Fn::Sub: "arn:aws:iam::cool_1234_abc:role/ComputeRole"
        Memory: 16000
        Vcpus: 1
        ResourceRequirements:
          - Type: GPU
            Value: '1'
      JobDefinitionName: MyGPUJobDefinition
      Timeout:
        AttemptDurationSeconds: 500

When I start a job, the job is stuck in RUNNABLE state forever, then I did these:

  1. When I swapped the instance type to be normal CPU types, redeploy the CF stack, submit a job and the job could be run and succeeded fine, so must be something missing/wrong with the way I use these GPU instance types on AWS Batch;
  2. Then I found this post, so I added an ImageId field in my ComputeEnvironment with a known GPU optimized AMI, but still no luck;
  3. I did a side by side comparison for the jobs between the working CPU AWS Batch and non-working GPU AWS Batch via running aws batch describe-jobs --jobs AWS_BATCH_JOB_EXECUTION_ID --region us-west-2, I found that what's missing between them is: containerInstanceArn and taskArn where in the non-working GPU instance, these two fields are just missing.
  4. I found that in the ASG (Auto Scaling Group) created by the Compute Environment, this GPU instance is in this ASG, but when I go to the ECS, chose this GPU cluster, there's no container instances associated with it, unlike the working CPU ones, where the ECS cluster has container instances within.

Any ideas how to fix this would be greatly appreciated!

Why should i use aws firelense? When i want to send logs to elastic search endpoint?

So i was new to ECS Fargate. I was trying to send logs from ECS Fargate application to elastic search endpoint. Here everyone seems to be using Aws Firelense with fluentbit image of aws. We already had filebeat configured where we were previously running our application in EC2 instance. But from ECS fargate seems we can't use filebeat. I was not able to find any docs to refer to. Just wanted to know if its even possible.

Also do i need to use firelense if i use filebeat? Currently it seems firelense only supports fluentbit and fluenID.

I was using below task definition but it was not ingesting logs.

        {
        "family": "fargate-poc",
        "containerDefinitions": [
            {
                "name": "cservice",
                "image": "******.dkr.ecr.us-east-1.amazonaws.com/service:2b1bb47",
                "cpu": 512,
                "portMappings": [
                    {
                        "name": "service-8080-tcp",
                        "containerPort": 8080,
                        "hostPort": 8080,
                        "protocol": "tcp"
                    }
                ],
                "essential": true,
                "environment": [
                    {
                        "name": "name_env",
                        "value": "egggggrgggggf"
                    },
                    {
                        "name": "JAVA_OPTS",
                        "value": "-XshowSettings:vm -Xmx1g -Xms1g"
                    },
                    {
                        "name": "SPRING_PROFILES_ACTIVE",
                        "value": "gggggg"
                    }
                ],
                "mountPoints": [
                    {
                        "sourceVolume": "logs",
                        "containerPath": "/srv/wps-*/logs"
                    }
                ],
                "volumesFrom": [],
                "logConfiguration": {
                    "logDriver": "awslogs",
                    "options": {
                        "awslogs-create-group": "true",
                        "awslogs-group": "/ecs/service-poc",
                        "awslogs-region": "us-east-1",
                        "awslogs-stream-prefix": "service"
                    },
                    "secretOptions": []
                }
            },
            {
                "name": "filebeat",
                "image": "*******.dkr.ecr.us-east-1.amazonaws.com/filebeat-non-prod:latest",
                "cpu": 256,
                "memory": 256,
                "portMappings": [],
                "essential": true,
                "environment": [],
"command": [
                "/bin/bash",
                "-c",
                "aws s3 cp s3://ilebeat/filebeat-fargate.yml /etc/filebeat/filebeat.yml && filebeat -e -c /etc/filebeat/filebeat.yml"
            ],
                "mountPoints": [
                    {
                        "sourceVolume": "logs",
                        "containerPath": "/usr/share/filebeat/logs"
                    }
                ],
                "volumesFrom": [
                    {
                        "sourceContainer": "service",
                        "readOnly": false
                    }
                ],
                "logConfiguration": {
                    "logDriver": "awslogs",
                    "options": {
                        "awslogs-create-group": "true",
                        "awslogs-group": "/ecs/service-poc",
                        "awslogs-region": "us-east-1",
                        "awslogs-stream-prefix": "filebeat"
                    },
                    "secretOptions": []
                }
            }
        ],
        "taskRoleArn": "arn:aws:iam::******:role/fargate-poc-task-role",
        "executionRoleArn": "arn:aws:iam::****:role/fargate-poc-task-role",
        "networkMode": "awsvpc",
        "volumes": [
            {
                "name": "logs",
                "host": {}
            }
        ],
        "requiresCompatibilities": [
            "FARGATE"
        ],
        "cpu": "1024",
        "memory": "2048"
    }

Thanks

How to provide same env with different values in ECS container?

I have to provide API_URL as the environment variable to ECS container. I created Task Definition and provide the API_URL environment variable as http://production.api.com.

It's working fine, but now we have to create staging environment and need to pass different value to API_URL.

I was following https://docs.aws.amazon.com/AmazonECS/latest/developerguide/taskdef-envfiles.html docs.

I get the idea that, we have to create different task definitions for production and staging.

Is there any other way to use same task definition and pass the API_URL value differently ?

ECS Service Connect DNS Resolution

I am trying to get ECS Service Connect to work our services. But for some reason, just not able to get it to work. What we have in place:

  • One service (admin) needs to get some data from another service (metrics).
  • Both services are deployed on ECS
  • The containers in both services are using the awsvpc network mode
  • Service Connect has been configured for both services

Service Connect for Metrics Service

Service Connect for Admin Service

Both services are in the same cluster and using the same Service Connect Namespace (service-discovery-cluster). This is reflected in CloudMap entry for the Namespace.

CloudMap entries for the services

So far, all this was as expected. And the services are accessible via an ALB and working fine. However, I would have expected that the Admin Service can call the Metric service with http://metrics.service-discovery-cluster (the recommended http://service-dns.cluster pattern), but this cannot be resolved by the HTTP Library.

I know I am missing something very basic somewhere but just can't work it out.

Could someone help out, please?

Other details:

  1. I am using Unirest with Java (I don't think that should matter, but still)
  2. All tasks have awsvpc networking mode
  3. Exact error: java.net.UnknownHostException: metrics.service-discovery-cluster: Name or service not known

I tried using ports for the services (e.g., http://metrics.service-discovery-cluster:8084), but that didn't work either.

AWS Fargate ECS - Can I use one container for multiple tasks?

I have a frontend application and backend which processes data and calculates metrics based on user input from frontend. The backend is dockerized and every time user wants to compute something a new task is created with overriding container parameters. It happens that there are 10 users computing something, so 10 tasks are spawned on same container. I am not sure if I can do it? Do the tasks share the container or they always run it separately? The computation can take even 2 hours so I cannot use lambda. On documentation I read that the task should not share resources. But it is not really transparent. Thank you!

I tried multiple different architectures. This seems to be fastest, I am just worried if I can use one container like this on multiple tasks.

Can't connect to fargate task which Executes command even though all permissions are set

I'm having trouble connecting to a fargate container with the ECS Execute command and it gives out the following error:

An error occurred (TargetNotConnectedException) when calling the ExecuteCommand operation: The execute command failed due to an internal error. Try again later.

I've made sure I have the right permissions and setup by using ecs-checker and I'm connecting to it using the following command:

aws ecs execute-command --cluster {cluster-name} --task {task_id} --container {container name} --interactive --command "/bin/bash".

I've noticed that this can usually happen when you don't have the necessary permissions but as I've pointed out above I've already checked with the ecs-checker.sh and here is the output from it:

-------------------------------------------------------------
Prerequisites for the AWS CLI to use ECS Exec
-------------------------------------------------------------
  AWS CLI Version        | OK (aws-cli/2.13.4 Python/3.11.4 Darwin/22.4.0 source/arm64 prompt/off)
  Session Manager Plugin | OK (1.2.463.0)

-------------------------------------------------------------
Checks on ECS task and other resources
-------------------------------------------------------------
Region : eu-west-2
Cluster: cluster
Task   : 47e51750712a4e1c832dd996c878f38a
-------------------------------------------------------------
  Cluster Configuration  | Audit Logging Not Configured
  Can I ExecuteCommand?  | arn:aws:iam::290319421751:role/aws-reserved/sso.amazonaws.com/eu-west-2/AWSReservedSSO_PowerUserAccess_01a9cfdb5ba4af7f
     ecs:ExecuteCommand: allowed
     ssm:StartSession denied?: allowed
  Task Status            | RUNNING
  Launch Type            | Fargate
  Platform Version       | 1.4.0
  Exec Enabled for Task  | OK
  Container-Level Checks |
    ----------
      Managed Agent Status
    ----------
         1. RUNNING for "WebApp"
    ----------
      Init Process Enabled (WebAppTaskDefinition:49)
    ----------
         1. Enabled - "WebApp"
    ----------
      Read-Only Root Filesystem (WebAppTaskDefinition:49)
    ----------
         1. Disabled - "WebApp"
  Task Role Permissions  | arn:aws:iam::290319421751:role/task-role
     ssmmessages:CreateControlChannel: allowed
     ssmmessages:CreateDataChannel: allowed
     ssmmessages:OpenControlChannel: allowed
     ssmmessages:OpenDataChannel: allowed
  VPC Endpoints          |
    Found existing endpoints for vpc-11122233444:
      - com.amazonaws.eu-west-2.monitoring
      - com.amazonaws.eu-west-2.ssmmessages
  Environment Variables  | (WebAppTaskDefinition:49)
       1. container "WebApp"
       - AWS_ACCESS_KEY: not defined
       - AWS_ACCESS_KEY_ID: not defined
       - AWS_SECRET_ACCESS_KEY: not defined

What is weird about this situation is that there are 4 environments that the service is deployed to and it works on all of them except on one of them. And they are all the same resources deployed since the clusters are created through a cloudformation template. The image deployed is also the same in all 4 environments.

Any ideas on what could cause this?

Disable Core File Dumps on Docker Image/Container Hosted on AWS ECS/Fargate

My docker container runs a python app (backend API) that allows users upload various documents, PDF's mostly. so i think due to the pdf/file(s) upload, the container keeps creating core dump files: shown below Screenshot_corefiles

This slows down the container an eventually the container would crash! I have used a similar question asked on stack overflow (how-to-disable-core-file-dumps-in-docker-container)but the solution seems to work on docker run on a local pc. how can i fix this on a production environment.

container runs ubuntu:22.04

i use a dockerfile:

below is my docker file config:

FROM python:3.9

RUN mkdir /code

WORKDIR /code

COPY requirements.txt .

RUN pip install -r requirements.txt

# Download the pandoc deb file
RUN apt-get update && apt-get install -y wget
RUN wget https://github.com/jgm/pandoc/releases/download/3.1.2/pandoc-3.1.2-1-amd64.deb

# Install the downloaded deb file
RUN dpkg -i pandoc-3.1.2-1-amd64.deb

COPY . .

CMD ["gunicorn", "-w", "17", "-k", "uvicorn.workers.UvicornWorker", "--timeout", "120", "main:app", "-b", "0.0.0.0:80"]

i also use a task definition to deploy my container:

{
    "taskDefinitionArn": "arn:aws:ecs:us-west-2:$ARN:task-definition/a$task-def:30",
    "containerDefinitions": [
        {
            "name": "$NAME",
            "image": "$ARN.dkr.ecr.us-west-2.amazonaws.com/$IMAGE-NAME:22636912fe7ab73cf3bd23bdb3d88d317d00b272",
            "cpu": 0,
            "portMappings": [
                {
                    "name": "$CONTAINER_NAME-80-tcp",
                    "containerPort": 80,
                    "hostPort": 80,
                    "protocol": "tcp",
                    "appProtocol": "http"
                }
            ],
            "essential": true,
            "environment": [],
            "environmentFiles": [
                {
                    "value": "arn:aws:s3:::$S3_Resource",
                    "type": "s3"
                }
            ],
            "mountPoints": [],
            "volumesFrom": [],
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-create-group": "true",
                    "awslogs-group": "/ecs/$LOG_Group",
                    "awslogs-region": "us-west-2",
                    "awslogs-stream-prefix": "ecs"
                }
            }
        }
    ],
    "family": "$LOG_FAMILY",
    "taskRoleArn": "arn:aws:iam::$ARN:role/ecsTaskExecutionRole",
    "executionRoleArn": "arn:aws:iam::$ARN:role/ecsTaskExecutionRole",
    "networkMode": "awsvpc",
    "revision": 30,
    "volumes": [
        {
            "name": "new",
            "host": {}
        }
    ],
    "status": "ACTIVE",
    "requiresAttributes": [
        {
            "name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
        },
        {
            "name": "ecs.capability.execution-role-awslogs"
        },
        {
            "name": "com.amazonaws.ecs.capability.ecr-auth"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
        },
        {
            "name": "ecs.capability.env-files.s3"
        },
        {
            "name": "ecs.capability.increased-task-cpu-limit"
        },
        {
            "name": "com.amazonaws.ecs.capability.task-iam-role"
        },
        {
            "name": "ecs.capability.execution-role-ecr-pull"
        },
        {
            "name": "ecs.capability.extensible-ephemeral-storage"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
        },
        {
            "name": "ecs.capability.task-eni"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.29"
        }
    ],
    "placementConstraints": [],
    "compatibilities": [
        "EC2",
        "FARGATE"
    ],
    "requiresCompatibilities": [
        "FARGATE"
    ],
    "cpu": "8192",
    "memory": "24576",
    "ephemeralStorage": {
        "sizeInGiB": 200
    },
    "runtimePlatform": {
        "cpuArchitecture": "X86_64",
        "operatingSystemFamily": "LINUX"
    },
    "registeredAt": "2023-07-30T20:49:22.769Z",
    "registeredBy": "arn:aws:sts::$ARN:assumed-role/github/github",
    "tags": []
}

github actions script:

name: Deploy Document-Management-service To Amazon ECS

on:
  push:
    branches:
      - "main"

env:
  AWS_REGION:                  # set this to preferred AWS region, e.g. us-west-1
  ECR_REPOSITORY:      # set this to your Amazon ECR repository name
  ECS_SERVICE:         # set this to your Amazon ECS service name
  ECS_CLUSTER:         # set this to your Amazon ECS cluster name
  ECS_TASK_DEFINITION: .github/workflows/main-task-definition.json      # set this to the path to your Amazon ECS task definition                                           # file, e.g. .aws/task-definition.json
  CONTAINER_NAME:            # set this to the name of the container in the
                                               # containerDefinitions section of your task definition

permissions:
  id-token: write
  contents: read # This is required for actions/checkout@v2

jobs:
  deploy:
    name: Deploy
    runs-on: ubuntu-latest
    environment: production

    steps:
    - name: Checkout
      uses: actions/checkout@v3
    - name: Configure AWS credentials
      uses: aws-actions/configure-aws-credentials@v1
      with:
        role-to-assume: ${{ secrets.AWS_ARN }} #AWS ARN With IAM Role
        role-session-name: github
        aws-region: ${{ env.AWS_REGION }}

    - name: Login to Amazon ECR
      id: login-ecr
      uses: aws-actions/amazon-ecr-login@v1

    - name: Build, Push, Tag and Deploy Container to ECR.
      id: build-image
      env:
        ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
        IMAGE_TAG: ${{ github.sha }}
      run: |
        docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG --ulimit core=0 .
        docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
        echo "::set-output name=image::$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG"
    - name: Fill in the new image ID in the Amazon ECS task definition
      id: task-def
      uses: aws-actions/amazon-ecs-render-task-definition@v1
      with:
        task-definition: ${{ env.ECS_TASK_DEFINITION }}
        container-name: ${{ env.CONTAINER_NAME }}
        image: ${{ steps.build-image.outputs.image }}

    - name: Deploy Amazon ECS task definition
      uses: aws-actions/amazon-ecs-deploy-task-definition@v1
      with:
        task-definition: ${{ steps.task-def.outputs.task-definition }}
        service: ${{ env.ECS_SERVICE }}
        cluster: ${{ env.ECS_CLUSTER }}
        wait-for-service-stability: true

i tried adding: --ulimit core=0 to my github action script it looked like this docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG --ulimit core=0 .

but apparently I realized it was said to be used with a docker run command

so is there any way to disable core file dump on a production environment??

Why is my ECS cluster with AutoScaling group as capacity provider not working?

No Container Instances were found in your capacity provider

I want to use an autoscaling group as capacity provider for a ECS Cluster. Even tho I just want one container per container instance I chose the awsvpc as the network mode of my task definition. In other templates I create the autoscaling group with a launch template (in private subnets with NAT), a load balancer and a target group.

  • I chose 'ip' as target type in TargetGroup because of the awsvpc mode in my task definition,

  • of course, target group is NOT associated with my autoscaling group,

  • I'm using an ECS-optimized AMI,

  • I haven't added userdata to my launch template

Still, when I try to create my service in the cluster an error shows: 'No Container Instances were found in your capacity provider'

What could it be? I'm not sure if it. was to do with policies, roles and stuff

I've read that some people may add userdata to the launch template but I'm not sure that's a solution for me. I want an autoscaling group as a capacity provider, not a single server.

โŒ
โŒ