Accelerating drug discovery through knowledge graph

May 31, 2022 Kannan Raman

Bio pharma companies have an increasing need to accelerate insights specific to drug discovery, leveraging molecular, manufacturing, lab data and other data sets. The use of an AWS knowledge graph across the drug discovery value chain could deliver just that value.

Overview of the current situation

In 2019 the cost of R&D in pharma was estimated at $83 Billion (Congressional Budget Office estimates). The number of new drugs approved for sale has gone up significantly, an increase of 60% compared to the previous decade. However, the opportunity to accelerate drug discovery through digital transformation is a game changer as Biopharmaceutical companies race to find therapies in under-served therapeutic areas.

Modernizing research sites to automate drug discovery workflows will allow organizations to realize a Pharma 4.0 vision. This new vision can be focused on outcomes specific to clinical, process development, and manufacturing data aggregation.

Over the last several years, Biopharmaceutical companies have collected troves of data either in siloed or connected formats residing in research, discovery, and manufacturing process-specific workflows. These data sources include:

  • In-silico analysis of molecular structures
  • Wet lab analysis during lead identification
  • Data generated during molecule creation in labs or in cell line
    • depending on whether they are small or large molecules
  • Clinical trial data
    • where adverse events specific to a drug can be associated to different factors
  • Systematic documents
    • used in bio-medicine to contextualize experimental data

When a discovery team embarks on new research for a therapy by evaluating candidate molecules, the chances are very likely that they have existing information about many of these molecules captured within multiple internal data systems.

However, this data is largely unavailable to the discovery team, as it is distributed across systems and departments. Data from manufacturing inputs may also not be visible. It can take weeks or months to connect this data and get a full view of all information available on a particular candidate molecule, or anything else.

The solution

This problem of siloed, disconnected data sources can be solved through creation of a knowledge graph. A knowledge graph can unlock the ability to quickly connect and find data. This helps in lead optimization as various factors, based on previous experiments, can inform decisions on success probability and if it’s worth pursuing further research.

A solution design architecture leveraging multiple AWS services can effectively connect data across various silos and source systems. The consolidated data can then be applied to different downstream tasks.

Architecture of knowledge graph construction and downstream tasks

Architecture of knowledge graph construction and downstream tasks

The architecture shown above addresses two primary needs:

  1. user experience for the front-end and
  2. data centralization setup associated with the back-end.

In the process of developing a new drug product, access to research efforts from different domains from past drug design development is crucial. Such domains are scattered in different source systems, and requires a back-end pipeline.

The back-end provides the infrastructure to ingest siloed data into Amazon Neptune by leveraging AWS Glue jobs to perform the necessary scheduled Extract, Transform, and Load (ETL) processes. Once AWS Glue is used to perform data conversion into a data format that fits with the knowledge graph, the data is staged in an Amazon Simple Storage Service (Amazon S3) bucket. After which, AWS Lambda is used to push the data from Amazon S3 into Amazon Neptune. Status of the ETL jobs is handled via notifications using Amazon Simple Notification Service (Amazon SNS) and monitored using Amazon CloudWatch. This type of pipeline can help connect disparate data into an ever-growing knowledge graph that would otherwise take days to access individually.

Furthermore, a predictive model can be developed via Amazon SageMaker as an option to extend the knowledge graph using libraries like Deep Graph Library (DGL) for different downstream tasks. This includes, but is not limited to, node classification for toxicity classification or edge inference for drug similarity recommendation. The tasks out of such a model can further enrich the knowledge graph.

Another downstream task of the knowledge graph is its use as a search app, which can be deployed using AWS Amplify. The deployed app provides users access to the centralized knowledge graph data to perform searches to discover against the constructed graph’s data. As a result, end users can use the identified data to help make better decisions on the next steps towards drug development. A search could lead to adjusting an experiment’s parameter to increase drug product yield based on a past experiment on a similar drug.


Siloed information, spread across an organization, is a challenge but can be solved by implementing a knowledge graph. This can serve different end-users, whether that is comparing the characteristics of different historic drugs or to use the historic data to build predictions for a new drug design. Instituting a knowledge graph can help provide cost savings while enabling a faster time-to-market through acute drug discovery cycle-time reduction and improved visibility into molecule dossier project tracking. Additionally, insights and timely reporting across the organization help accelerate drug discovery and reduce attrition of drug candidates, which can be achieved through knowledge graph constructs that incorporate dossier level data inputs.

It can be challenging for biopharma life science customers to understand where to start, which is why AWS offers workshops specifically designed to support and facilitate the development of knowledge graph programs. Reach out to your account executive or contact the AWS sales team to understand how you can get started with AWS to initiate or accelerate drug discovery using knowledge graph.

Previous Article
Executive Conversations: Part Two: Cybersecurity Implementation and Best Practices with Shawn Henry, President of CrowdStrike Services & Chi
Executive Conversations: Part Two: Cybersecurity Implementation and Best Practices with Shawn Henry, President of CrowdStrike Services & Chi

Shawn Henry, President of CrowdStrike Services & Chief Security Officer at CrowdStrike, joins Phoebe Yang o...

Next Article
AWS 4th annual healthcare and life sciences industry innovators event
AWS 4th annual healthcare and life sciences industry innovators event

AWS brings together industry leaders to discuss creating a digital foundation for personalized health Patie...