I am working on a system that involves launching multiple AWS batch jobs, each takes between 5 to 15 minutes on average to complete.
I need to provide a mechanism that will let me know once all the jobs have completed successfully, or if any failures have occurred, so I can proceed to the next step accordingly.
Also, I don’t yet have a good strategy for handling errors. For example, how to deal with the case where only one or a few jobs failed after a certain number of retries? My first guess is that we can have a failure threshold that will dictate whether or not the overall step in the process (i.e. the collection of AWS batch jobs) can/should proceed. Something along the lines of
if failed_jobs > failed_job_threshold: raise RuntimeError("Too many failed jobs")
The process recreates/repopulates a database table periodically (i.e. each month the overall job runs to recreate/repopulate a database table). Therefore any individual batch job failures will entail an incomplete database table and will require attention.
Is there a “best practice” architecture for handling this use case?
My development landscape includes Python, Terraform, Bitbucket Pipelines, AWS (Lambda, Batch, SQS, RDS/Aurora, etc.), and PostgreSQL.
Go to Source
Author: James Adams