How to manage/monitor multiple AWS batch jobs that should all complete successfully before the next step?

I am working on a system that involves launching multiple AWS batch jobs, each takes between 5 to 15 minutes on average to complete.

I need to provide a mechanism that will let me know once all the jobs have completed successfully, or if any failures have occurred, so I can proceed to the next step accordingly.

Also, I don’t yet have a good strategy for handling errors. For example, how to deal with the case where only one or a few jobs failed after a certain number of retries? My first guess is that we can have a failure threshold that will dictate whether or not the overall step in the process (i.e. the collection of AWS batch jobs) can/should proceed. Something along the lines of

if failed_jobs > failed_job_threshold:
    raise RuntimeError("Too many failed jobs")

The process recreates/repopulates a database table periodically (i.e. each month the overall job runs to recreate/repopulate a database table). Therefore any individual batch job failures will entail an incomplete database table and will require attention.

Is there a “best practice” architecture for handling this use case?

My development landscape includes Python, Terraform, Bitbucket Pipelines, AWS (Lambda, Batch, SQS, RDS/Aurora, etc.), and PostgreSQL.

Go to Source
Author: James Adams