Problem statement
Our customer uses Customizations for AWS Control Tower for the account vending. A new account in the specific organizational unit should deploy different resources as a baseline, for example, IAM roles, VPC with all networking components, and ECS cluster for further application deployment. ECS cluster creation requires a service-linked role that should be explicitly created in case of using CloudFormation. So, a native CloudFormation feature, «depends on» was used to create a strict order of resource creation.
This is the initial CloudFormation stack:
AWSTemplateFormatVersion: '2010-09-09' Description: 'AWS ECS Fargate cluster' Parameters: CapacityProviderTypes: Type: CommaDelimitedList AllowedValues: - FARGATE - FARGATE_SPOT EnvironmentTag: Type: String Conditions: IsProd: !Equals - !Ref EnvironmentTag - prod Resources: FargateClusterRole: Type: AWS::IAM::ServiceLinkedRole Properties: AWSServiceName: ecs.amazonaws.com FargateCluster: Type: AWS::ECS::Cluster DependsOn: - FargateClusterRole Properties: ClusterName: FargeetClusterPal CapacityProviders: !Ref CapacityProviderTypes ClusterSettings: - Name: containerInsights Value: enabled DefaultCapacityProviderStrategy: - CapacityProvider: !If [IsProd, FARGATE, FARGATE_SPOT]
If the service-linked role did not exist in advance, the stack failed sometimes and the root cause is the following. CloudFormation sends an API call to AWS to create a service-linked role and receives a successful response. But if, at the same time, we try to find the role in the IAM console, it will not be displayed in 100% of cases. It is not obvious, and not all people know it, but some delays are possible during updates in the IAM configurations.
As a service that is accessed through computers in data centers around the world, IAM uses a distributed computing model called eventual consistency. Any change that you make in IAM (or other AWS services), including tags used in attribute-based access control (ABAC), takes time to become visible from all possible endpoints. Some of the delay results from the time it takes to send the data from server to server, from replication zone to replication zone, and from Region to Region around the world. IAM also uses caching to improve performance, but in some cases this can add time. The change might not be visible until the previously cached data times out.
So, as a workaround, we had to implement a «sleep» step between the creation of the service-linked role and the ECS cluster itself to give it some time to propagate all changes and make our stack always work.
Proposed solution
Unfortunately, such a simple thing as «sleep» delay is absent in CloudFormation by the day of writing this post. So we had a couple of options.
The first idea was to create the service-linked role somewhere in previous steps of account vending, for example, during the VPC creation, but this is not quite the logically right solution. The service-linked role is related to the ECS stack, so, ideally, it should be created within it.
The second idea was to use CloudFormation custom resource with Lambda function, where we actually can implement whatever we need, including «sleep» timeout.
This is the new CloudFormation stack:
AWSTemplateFormatVersion: '2010-09-09' Description: 'AWS ECS Fargate cluster' Parameters: CapacityProviderTypes: Type: CommaDelimitedList AllowedValues: - FARGATE - FARGATE_SPOT EnvironmentTag: Type: String Conditions: IsProd: !Equals - !Ref EnvironmentTag - prod Resources: FargateClusterRole: Type: AWS::IAM::ServiceLinkedRole Properties: AWSServiceName: ecs.amazonaws.com FargateCluster: Type: AWS::ECS::Cluster DependsOn: - Delay Properties: ClusterName: FargeetClusterPal CapacityProviders: !Ref CapacityProviderTypes ClusterSettings: - Name: containerInsights Value: enabled DefaultCapacityProviderStrategy: - CapacityProvider: !If [IsProd, FARGATE, FARGATE_SPOT] Delay: Type: 'Custom::Delay' DependsOn: - FargateClusterRole Properties: ServiceToken: !GetAtt DelayFunction.Arn TimeToWait: 20 ### Custom resource for Delay (sleep), that is natively absent in CloudFormation LambdaRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: 2012-10-17 Statement: - Effect: Allow Principal: Service: - lambda.amazonaws.com Action: - sts:AssumeRole Path: / Policies: - PolicyName: "lambda-logs" PolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Action: - logs:CreateLogGroup - logs:CreateLogStream - logs:PutLogEvents Resource: - "arn:aws:logs:*:*:*" DelayFunction: Type: 'AWS::Lambda::Function' Properties: Handler: "index.handler" Timeout: 120 Role: !GetAtt 'LambdaRole.Arn' Runtime: python3.10 Code: ZipFile: | import json import cfnresponse import time def handler(event, context): time_to_wait = int(event['ResourceProperties']['TimeToWait']) print('wait started') time.sleep(time_to_wait) responseData = {} responseData['Data'] = "wait complete" print("wait completed") cfnresponse.send(event, context, cfnresponse.SUCCESS, responseData)
As a result, we have a couple of new blocks in the CloudFormation template, which could be replaced by one parameter. Such a feature has been requested since 2020 , but is still absent as a native CloudFormation functionality. Up to now, we can bypass this limitation with custom Lambda resources.
Conclusion
In this post, we looked at CloudFormation custom resource as a tool to implement a «sleep» delay between dependent parts creation within a stack. CloudFormation custom resource is a powerful function, that may be used for many other logics and interactions with third parties.