Exact Sciences operates the laboratories across the world that produce data that is critical to performing analysis and diagnostics to classify cancer modalities, treatments, and therapeutics. The laboratories generate large data sets from on-premises genomic sequencing devices that must be sent to the cloud for processing. Once in the cloud, we process the data to perform research or determine patient results. This causes a number of pain points because traditional data transfer workflows proved too inflexible to scale with our growth across many laboratory locations and required custom solutions to integrate with secondary processes.
As we modernized our processes, the new solution needed to provide scalability and near real time to support rapid expansion or pop-up labs including the large volume of data transfer.
We needed management to reduce on-premises infrastructure/processing which takes too long. Speed was essential as we need to accelerate data migration to cloud to decrease turnaround time. Lastly, it was important that our solution could eliminate custom solutions for transfer and notification to increase operational efficiency through integration.
In this blog post, we will share our solution for real-time data transfer of lab data which was built using native AWS technologies to scale and adapt to our increasing laboratory needs. Our solution uses AWS Storage Gateway S3 File Gateways and Amazon Simple Storage Service (Amazon S3) to facilitate rapid/ad-hoc laboratory expansion, process data in real time, notify downstream consumers (pipelines), and catalogue the data for long term research initiatives.
The NGS Data Lake is powered by AWS Storage Gateway as the platform for data ingestion and notification, leveraging Amazon DynamoDB, AWS Lambda functions and Amazon Simple Notification Service (SNS) for event processing and notifications and Amazon S3 for long term data storage.
AWS Storage Gateway
Storage Gateway hardware appliances are placed on-site in laboratories in close proximity to the sequencing platforms. Each Storage Gateway has one or more SMB file shares which are individually dedicated to a specific sequencer, these file shares are linked to an Amazon S3 bucket for data storage. The file share is mounted to the sequencing platform and data is written in real-time during sequencing and transferred immediately to the cloud. Each sequencing platform has unique data requirements, which can be reduced to a rate of data production (e.g. Gb/hour). We’re able to calculate how long a sequencer can run before filling up the cache if the appliance didn’t clear the cache when uploading data to AWS. We pick a target of 3 weeks and we support running the sequencers full time in the event there are issues uploading the data. This calculation is used to determine how many individual sequencing platforms a single storage appliance can support. In general, a single Storage Gateway appliance can support multiple sequencers, with a maximum of 10 sequencers per appliance to maintain 1:1 relationship to sequencer and S3 bucket
Storage Gateway file shares emit file upload events to Amazon EventBridge. We filter and queue these events on SQS and stream to AWS Lambda for processing. Our AWS Lambda code recognizes when a new sequencing run has started based on file and directory characteristics and triggers a run started event. All processing events are stored in Amazon DynamoDB and sent to Amazon SNS topic for notification to downstream consumers. We discover run metadata by looking up tags on the associated file share including lab location and sequencer platform to enrich the event. At the end of a sequencing run, the sequencer produces a predetermined run complete file (CopyComplete) that we watch for as a trigger to emit a CopyComplete upload event and initiate the upload complete validation process.
When the CopyComplete file appears, we confirm that all files in the run have been successfully uploaded to S3 by verifying that the v entire folder is empty. This straightforward step is powered by the AWS Storage Gateway NotifyWhenUploaded API which sends an asynchronous confirmation when the cache is empty. When we receive this notification, we trigger a run complete event which flows through SNS to consumers. The data transfers in real time during the sequencing run, so our data uploads are often complete within minutes of the sequencing run finishing.
In our data lake, each sequencing appliance has its own Amazon S3 bucket dedicated to the AWS Storage Gateway file share. The entire data lake is provisioned through automation, so we can easily control bucket policies, lifecycle management, encryption, etc. We configure every bucket with inventories to a centralized bucket and appropriate replication and access logging policies. Our data lake is WORM (write once read many) so that our sequencing data is never modified or deleted. Consumers of the data lake are granted read only access per their requirements.
Deploying a new file share and Amazon S3 bucket to the data lake requires only updating a configuration document to place a new sequencer ID on an existing S3 File Gateway. The automation will provision a file share on the File Gateway and link it to a new Amazon S3 bucket using the shared data lake bucket configuration settings. All file shares and buckets are catalogued in a separate Amazon DynamoDB table during the automation process including relevant details on how to mount the file share, such as file share IP address and fileshare name. Because these resources are virtual, we can easily shift where file shares are deployed to move capacity around as needed. Onsite technical staff configure the sequencers to write to the file share once it has been provisioned and that concludes the installation process.
There are two steps for deployment. We procure and install AWS Storage Gateway devices if we don’t have enough existing capacity at the site. We have dashboards which show available capacity at each site, so that we know if we require additional hardware. If we require more hardware, we can order from our preferred reseller or if in the US, through CDW, and it is shipped directly to the site and racked. Once the device is online we can manage the rest through AWS API and automation.
Exact Sciences has implemented AWS Storage Gateway as the foundation for NGS Data Lake on AWS, relying on the flexibility, scalability, ease of management, and native AWS integrations with decoupled services to rapidly scale NGS data transfer solutions across the country. Since our initial deployment, we have uploaded hundreds of sequencing runs (many TB data) across 4 laboratory locations in 3 different time zones with a footprint of 9 Storage Gateway physical appliances serving 25 sequencing devices. Our infrastructure and provisioning process has become standardized and new sequencing platforms are brought online faster than ever. The solution requires a small capital investment, but scales indefinitely to provide robust processing for short term time-sensitive workloads and long-term data lake capabilities, all while lessening on-premises footprint in favor of relying on cloud native services.