Standardizing quantification of expression data at Corteva Agriscience with Nextflow and AWS Batch

February 8, 2021 Anand Venkatraman

Authored by Anand Venkatraman, Bioinformatics Associate Research Scientist at Corteva Agriscience, and Srinivasarao Annapareddi, Cloud DevOps Engineer at Corteva Agriscience. The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

Data analysis in biological research today presents some interesting conundrums and challenges, including a rapidly increasing number and complexity of analytical methods, and many implementations of major algorithms and tools that do not scale well. As a result, reproducing the results of a pipeline or workflow can be challenging given the number of components, each having its own set of parameters, dependencies, supporting files, and installation requirements.

Corteva Agriscience is an agriscience company completely dedicated to agriculture, with the purpose of enriching the lives of those who produce and those who consume by ensuring progress for generations to come. At Corteva Agriscience, expression analysis continues to increase in complexity and scale, while facing those challenges for data analysis in biological research. This led to the creation of the Standardized Corteva Quantification Pipeline. By providing best practices for standardized quantification data, this pipeline takes a crucial first step toward mitigating and overcoming most of the data chaos from expression analysis while simultaneously catering to the needs of subject matter experts and downstream data management strategies.

Standardized Corteva Quantification Pipeline for Expression: Implementation with Nextflow and AWS Batch

Given these challenges and complexities, there were two possible paths for implementing the Standardized Corteva Quantification Pipeline:

  1. Continuous on-premises infrastructure with a huge number of compute resources always available. (But with the knowledge that there might be times when 90% of the capacity is un/underutilized, as well as the possibility that demand might be very high in some seasons and the existing infrastructure cannot meet the demand.)
  2. Spin up and down instances on the cloud on demand.

Keeping in mind the speed and demand at which expression data was needed, it became increasingly clear to us that the most viable solution would be the ability to spin up and down instances on the cloud on demand. Corteva Agriscience uses AWS Batch for many projects; it is a set of batch management capabilities that enables you to easily and efficiently run hundreds or thousands of batch computing jobs on AWS. We wanted to develop a solution that enhances AWS Batch capabilities without duplicating the neat features and processes it provides. Nextflow’s capabilities with AWS Batch + Spot instances slotted perfectly in this scenario, as Nextflow provides features that extend AWS Batch functionalities in a multi-fold manner.

The team’s decision to use Nextflow with AWS Batch as the solution for standardizing quantification data for expression was primarily based on these four (of many) salient features:

  1. Nextflow spares AWS Batch configuration steps by automatically taking care of the required Job definitions and Job requests as needed.
  2. Nextflow spins up the required computing instances, scaling up and down the number and composition of the instances to best accommodate the actual workload resource needs at any given point in time.
  3. Nextflow synergizes the auto-scaling ability provided by AWS Batch along with the use of spot instances to bring about huge savings in cost, time, and resources.
  4. Nextflow can reschedule failed job automatically, providing a truly fault-tolerant environment.

Standardized Corteva Quantification Pipeline for Expression: Bioinformatics tools, AWS compute environment, and architecture

The standardized quantification pipeline for expression written in Nextflow lingua uses these bioinformatics software programs: fastqc, bbtools, fastp, salmon, tximport, MultiQC. The underlying compute environment on AWS can scale up to 1024 vCPUS using the combination of r4.8xlarge, r5.8xlarge, r5d.8xlarge, and r5a.8xlarge instance types depending on the compute or memory needs of each of the bioinformatics processes within the workflow.  The architecture implemented with Nextflow and AWS Batch + Amazon EC2 Spot Instances is depicted in Figure 1.

Figure 1: Nextflow + AWS Batch architecture for Standardized Corteva Quantification Pipeline for Expression

Notable parts of the architecture are the Scheduler Batch Node, the Executor Batch Node, and the EC2 Batch template.

The Scheduler Batch Node is an EC2 instance launched by AWS Batch to run the main Nextflow process which schedules workflow processes. It is important that this node use on-demand instances so that the main scheduling process is not interrupted. To enable this, we created an AWS Batch compute environment and job queue to use just for scheduling nodes using CloudFormation like the following:

scheduler-batch-node

{
    "SchedulerCE": {
        "Type": "AWS::Batch::ComputeEnvironment",
        "Properties": {
            "Type": "MANAGED",
            "ServiceRole": {
                "Ref": "BatchServiceRole"
            },
            "ComputeEnvironmentName": {
                "Ref": "SchedulerComputeEnv"
            },
            "ComputeResources": {
                "MinvCpus": 0, "MaxvCpus": 128,  "DesiredvCpus": 0,
                "SecurityGroupIds": [ { "Ref": "BatchSecurityGroup" } ],
                "Type": "EC2",
                "Subnets": [
                    {
                        "Fn::ImportValue": { "Fn::Sub": "PrivateSubnetOneV1"  }
                    },
                    {
                        "Fn::ImportValue": {"Fn::Sub": "PrivateSubnetTwoV1"   }
                    }
                ],
                "ImageId":      { "Ref": "SchedulerAMI"   },
                "InstanceRole": { Ref": "InstanceRole" },
                "InstanceTypes": [ "optimal"  ],
                "Ec2KeyPair": { "Ref": "KeyName"  }
            },
            "State": "ENABLED"
        }
    },
    "SchedulerQueue": {
        "Type": "AWS::Batch::JobQueue",
        "Properties": {
            "ComputeEnvironmentOrder": [
                {
                    "Order": 1,
                    "ComputeEnvironment": {
                        "Ref": "SchedulerCE"
                    }
                }
            ],
            "State": "ENABLED",
            "Priority": 1,
            "JobQueueName": {
                "Ref": "SchedulerQueue"
            }
        }
    },
    "Scheduler": {
        "Type": "AWS::Batch::JobDefinition",
        "Properties": {
            "Type": "container",
            "JobDefinitionName": {
                "Ref": "SchedulerJobDef"
            },
            "ContainerProperties": {
                "Memory": 1024, "Privileged": true, "JobRoleArn": { "Ref": "JobRoleARN" },
                "ReadonlyRootFilesystem": false,
                "Vcpus": 1,
                "Image": { "Ref": "SchedulerImage"  }
            }
        }
    }
}

Workflow processes are run in Executor Batch Nodes which can run on EC2 SPOT instances. Again, we created a dedicated compute environment and job queue just for executor nodes using a CloudFormation snippet like the following:

executor-batch-node

{
    "ExecutorCE": {
        "Type": "AWS::Batch::ComputeEnvironment",
        "Properties": {
            "Type": "MANAGED",
            "ServiceRole": {
                "Ref": "BatchServiceRole"
            },
            "ComputeEnvironmentName": {
                "Ref": "ExecutorComputeEnv"
            },
            "ComputeResources": {
                "SpotIamFleetRole": {
                    "Ref": "SpotFleetRole"
                },
                "BidPercentage": 80,
                "MinvCpus": 0, "MaxvCpus": 2000,  "DesiredvCpus": 0,
                "SecurityGroupIds": [
                    {
                        "Ref": "BatchSecurityGroup"
                    }
                ],
                "Type": "spot",
                "Subnets": [
                    {
                        "Fn::ImportValue": {
                            "Fn::Sub": "PrivateSubnetOneV1"
                        }
                    },
                    {
                        "Fn::ImportValue": {
                            "Fn::Sub": "PrivateSubnetFiveV1"
                        }
                    }
                ],
                "ImageId": {
                    "Ref": "imageID"
                },
                "InstanceRole": {
                    "Ref": "InstanceRole"
                },
                "InstanceTypes": [
                "r5.large", "r4.large", "r5.8xlarge", "r4.8xlarge", “r5a.8xlarge", "r5d.8xlarge"
                ],
                "Ec2KeyPair": {
                    "Ref": "KeyName"
                }
            },
            "State": "ENABLED"
        }
    },
    "ExecutorQueue": {
        "Type": "AWS::Batch::JobQueue",
        "Properties": {
            "ComputeEnvironmentOrder": [
                {
                    "Order": 1,  "ComputeEnvironment": { "Ref": "ExecutorCE"  }
                }
            ],
            "State": "ENABLED",
            "Priority": 1,
            "JobQueueName": {
                "Ref": "ExecutorQueue"
            }
        }
    }
}

Executor Batch nodes also need some minimal provisioning to work with Nextflow. For this, we used a custom launch template like the following:

ec2_batch_template

{
    "Resources": {
        "EC2LaunchTemplate": {
            "Type": "AWS::EC2::LaunchTemplate",
            "Properties": {
                "LaunchTemplateName": {
                    "Fn::Join": [
                        "-",
                        [
                            {
                                "Ref": "EC2LaunchtemplateName"
                            },
                            {
                                "Fn::Select": [
                                    2,
                                    {
                                        "Fn::Split": [
                                            "/",
                                            {
                                                "Ref": "AWS::StackId"
                                            }
                                        ]
                                    }
                                ]
                            }
                        ]
                    ]
                },
                "LaunchTemplateData": {
                    "InstanceMarketOptions": {
                        "MarketType": "spot"
                    },
                    "BlockDeviceMappings": [
                        {
                            "Ebs": {
                                "DeleteOnTermination": true, "VolumeSize": 80,
                                "VolumeType": "gp2"
                            },
                            "DeviceName": "/dev/xvda"
                        }
                    ],
                    "UserData": {
                        "Fn::Base64": {
                            "Fn::Sub": "MIME-Version: 1.0\nContent-Type: multipart/mixed; boundary=\"==BOUNDARY==\"\n\n--==BOUNDARY==\nContent-Type: text/cloud-config; charset=\"us-ascii\"\n\npackages:\n- wget\n\nruncmd:\n- wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /home/ec2-user/Miniconda3-latest-Linux-x86_64.sh\n- bash /home/ec2-user/Miniconda3-latest-Linux-x86_64.sh -b -f -p /home/ec2-user/miniconda\n- /home/ec2-user/miniconda/bin/conda install -c conda-forge -y awscli\n- rm /home/ec2-user/Miniconda3-latest-Linux-x86_64.sh\n\n--==BOUNDARY==--\n"
                        }
                    }
                }
            }
        }
    },
    "Outputs": {
        "LaunchTemplateId": {
            "Description": "NFS EC2 Launch Template ID for AWS Batch use ",
            "Value": {
                "Ref": "EC2LaunchTemplate"
            }, "Export": {"Name": "batch-ec2-launchtemplate"} 
      } 
    }
}

Conclusions

The Standardized Corteva Quantification Pipeline implemented with Nextflow and AWS Batch + EC2 Spot instances has resulted in significant savings in time and resources, as one can spin up a multitude of data analytics jobs on as-needed basis without the requirement for continual infrastructure. The pipeline has provided Corteva researchers worldwide with a best practices platform to quantitate expression data at scale, with the focus being on data analysis, rather than worrying about computational scalability and on-premises infrastructure costs.

After deployment, this pipeline continues to be heavily used to generate large datasets in a time-critical manner. Data generated by the pipeline feeds into an in-house expression data repository. The importance of this pipeline is underscored by the fact that it serves a crucial component for researchers developing innovative ways to assess regulatory and genic regions of genomes.

The Nextflow + AWS Batch solution we developed for the Standardized Corteva Quantification Pipeline will serve as a prototype for developing similar computationally intensive genomics pipelines (e.g., annotation/assembly) of varying complexities that are well positioned to be orchestrated in Nextflow. These efforts will thereby facilitate important biological or computational questions to be answered by researchers, while simultaneously enabling fault-tolerance, automation, and reproducibility of the pipelines being deployed. The futuristically designed prototype will be used as guide to migrate legacy bioinformatics workflows to Nextflow code which will help overcome issues with maintenance and reproducibility of those pipelines.

Next Steps

To learn more about the technologies in this blog, see Nextflow on AWS Batch and AWS Biotech Blueprint with Nextflow.

Previous Article
Applying the AWS Shared Responsibility Model to your GxP Solution
Applying the AWS Shared Responsibility Model to your GxP Solution

The AWS Shared Responsibility Model is often discussed as a topic to illustrate AWS security principles, bu...

Next Article
Improving the Utilization of Wearable Device Data using an AWS Data Lake
Improving the Utilization of Wearable Device Data using an AWS Data Lake

The key to meaningful progress in key global healthcare initiatives, like value-based care, reduction in co...