Insights

Artificial Intelligence in Drug Discovery and Hit Finding

A novel machine learning technology for in silico small molecule screening at scale

White paper - ZeptoNet, a novel AI technology for virtual screening of small molecules

Understanding how to facilitate and promote drug innovation has always been crucial for health care developments. Drug discovery remains one of the most important stages of the drug development process. Its relevance can be explained by the fact that, during this stage, chemical compounds having the potential to become candidate medication are identified. Therefore, a crucial step in drug discovery is the process of finding chemical compounds that interact in a meaningful way with biological targets that cause or influence a disease. Drug discovery from small compounds is, however, a slow and uncertain process. Despite the deep level of expertise, funding and time needed in leading drug discovery campaigns, the failure rate remains high and setbacks are still frequent. This uncertain and complex process explains why it can take a long time to find the right cure for new diseases. Hence, advances to facilitate this process are crucial in the field of drug discovery.

In order to facilitate the drug discovery process, Kantify has developed ZeptoNet, a transformative Artificial Intelligence (AI) technology. This technology can be used to perfect and accelerate early stages of the drug discovery process by using virtual high-throughput screening (vHTS). This technology therefore solves the core problem faced by the drug discovery process, namely, the high amount of time that it takes for new drugs to be discovered. Virtual HTS uses computer algorithms to predict whether a compound will be active for a given bioassay, leading to a faster drug discovery process and a lower clinical failure rate. Thanks to a series of innovations, ZeptoNet is opening new opportunities for drug discovery which will be discussed in this paper.

Virtual High-Throughput Screening

The process of drug discovery from small chemical compounds can be divided into 7 main steps:

1. Target Validation, where the biological source of a disease is discovered and confirmed.

2. Assay Development, where the target is tested to assess the activity or a drug/biochemical.

3. High-Throughput Screening (HTS), where a vast number of chemical or biological tests are quickly undertaken to identify compounds that are biologically relevant.

4. Hit to Lead, where small molecule hits from HTS are assessed to identify lead compounds.

5. Lead Optimization, where drug candidates which are strong and safe are brought into preclinical trials.

6. Preclinical Trials, where important and frequent testing and drug safety data are collected. Laboratory animals are typically involved at this stage.

7. Clinical Trials, where the observations and experiments are done on human participants.

During in vitro high-throughput screening, batches of small compounds are tested in vitro to measure bioactivity against a target, such as a protein for instance. Bioactivity can involve the activation or deactivation of a target. When a compound is found active, it can be retained as a primary hit. This means that the molecule has shown the wanted type of activity in a screening assay (the compound is active) and will be used for later drug discovery stages.

Unlike HTS, computer-aided HTS or vHTS works in silico. This means that vHTS is performed on a computer or through computer simulation. There are several categories of vHTS, two of which are machine learning and docking. Docking is a structure-based in silico method. It involves the interaction of two or more molecules. This method is used to predict both the binding affinity between ligand and protein and the structure of protein-ligand complex. Although docking has been successful, this method has certain limitations that need to be taken into consideration. Docking software tends to be largely dependent on the characteristics of the binding site and ligand, making it difficult to determine the best search algorithms and scoring function for studies using that software. In addition to this, docking demands a large amount of computation which can result in it being a rather slow system.

Machine learning has emerged as a new and very promising technology used in the challenging process of finding early stage drug candidates. Machine learning is a broad field of Artificial Intelligence, whose main function in vHTS relates to supervised learning. Supervised learning tasks aim at identifying sets of rules by looking at and analyzing representative examples (e.g. learning to recognize images of cats and dogs based on annotated samples of images used as training examples). Lavecchia (2015) offers an overview of (supervised) machine learning algorithms in drug discovery. In the context of vHTS, chemical compounds are shown to supervised learning models. Hits and misses are then defined based upon previous HTS trials. This allows for the detection of new hits across compounds that have not been screened yet. The Merck Molecular Activity Challenge hosted by Kaggle in 2012 can be seen as the first large-scale event that sparked interest around the use of machine learning in virtual HTS. Ever since, the idea of using machine learning in vHTS has been considered as very attractive in the scientific community.

Limitations and Challenges

While being promising, the computational methods mentioned so far suffer from fundamental problems:

1. Slow: Computational methods of drug discovery tend to be computationally very expensive. Docking algorithms, for example, require significant time to run for each candidate, making extensive searching difficult.

2. Lossy: The algorithms typically only include specific, and limited data to make predictions. Traditional machine learning algorithms, such as Support Vector Machines, or Random Forests, can only include data in tabular format. They cannot easily accept 3D representations of drug candidates without lossy preprocessing steps.

3. From scratch: Most algorithms are target-specific. This means that additional information about unrelated proteins that could share similar binding processes cannot be used to improve the algorithms. In reality, all algorithms are trained from scratch for a specific target.

4. Data hungry: Machine-learning vHTS algorithms suffer from a “chicken and egg” problem: in order to train an algorithm, a large set of data is needed (usually from existing HTS runs). If this data already exists, there is usually no need to have a vHTS algorithm. This is due to the fact that researchers already know which candidates are likely to bind to the target protein. Hence, algorithms do not work well on limited datasets.

A disruptive, novel drug discovery approach

Kantify has developed and trained a deep neural network, ZeptoNet, suited for predicting whether or not a drug candidate shows a certain type of bioactivity (e.g. inhibition or activation) on a protein. ZeptoNet differs from previous technologies in three main ways. First, it has a universal approach meaning that it can collect data from unrelated HTS trials in order to understand fundamental biophysical mechanisms. Second, ZeptoNet is extensible and able to include a wide range of features from both proteins and other compounds. Third, it is an efficient technology that accelerates the drug discovery process while being inexpensive compared to previous systems.

Key technical innovations

From the past decade onwards, deep learning, a machine learning function aiming at imitating the functions of the brain when using data to create patterns, has shined as the state-of-the art approach in computer and natural language processing tasks. Promises held by deep learning in accelerating drug discoveries are high and excitement in both scientific and pharmaceutical communities has kept growing over time.

Prior computational methods to deep learning already exist but Kantify’s innovations unveil the following new possibilities.

1. Use of transfer learning: Transfer learning is a mechanism used in deep learning where knowledge from a generic problem is transferred to a domain specific problem. For instance, Convolutional Neural Networks trained on large purpose image datasets, such as ImageNet, have been known for demonstrating great classification performance in a more constrained settings (e.g. discriminating images of cats and dogs). Zeiler and Fergus (2014) have shown that this kind of mechanism works because Deep Neural Networks can learn fundamental aspects of human vision such as edges, straight lines or circles. In this way, ZeptoNet has been trained on a set of 100+ bioassays using transfer learning. ZeptoNet is therefore engineered as a universal model, which may be optimized towards new and independent HTS trials. As such, instead of having to be trained from scratch - which is a common process in hit discovery efforts - ZeptoNet learns underlying biophysical processes that lead to bioactivity based on a variety of bioassays.

In this regard, Kantify’s approach is radically different from usual vHTS methods, including so-called “multitask” machine learning approaches. Because of this innovation, Zeptonet can help drug discovery efforts to improve the drug discovery process by requiring less data to find drug candidates, and being better at it than traditional algorithms.

2. Use of novel featurization of both proteins and compounds: Featurization is the process of turning information into a format that is understandable for machine learning algorithms. Most machine learning algorithms have traditionally been limited to using data in tabular format. ZeptoNet includes a novel featurization pipeline that allows both tabular (e.g. compound properties and fingerprints) and non-tabular data to be included (e.g. the graph structure of the compound or target protein). This data can be used as input for the machine learning algorithm, resulting in increased predictive performance.

3. Use of active learning: Active learning is a mechanism where machine learning algorithms are remembered and can be used to label data. Within machine-learning based virtual HTS, this technique can be used to iteratively run a limited HTS trial, use the information from these trials to train the machine learning algorithm, and use the output of the algorithm to select the best candidates for a next HTS trial. ZeptoNet has been specifically tailored to significantly accelerate active learning. This results in fewer trials and fewer screened compounds to find hits in a compound library.

Based on its unique set of features and the application of transfer learning, ZeptoNet shows beyond state of the art results in predictive performance, data requirement reduction, computational efficiency and hit finding acceleration rate.

ZeptoNet solves the challenges listed above in the following way:

Fast: ZeptoNet accelerates the finding of hits while having a superior performance.

Rich: ZeptoNet ‘s architecture enables to integrate a variety of data inputs. This leads to very accurate predictions, not only on the likeliness of a hit, but also on various physical and chemical properties.

Pretrained: ZeptoNet does not have to be trained from scratch as it is pretrained on several bioassays and already has an understanding of biophysical properties.

Data lean: As ZeptoNet is pretrained, it can start making accurate predictions even when there are very few known hits.

In Silico Performance

ZeptoNet performance is compared to state-of-the-art performances on 3 use-cases:

  1. Intrinsic performance on ZeptoNet’s pretrained model: inference on known targets but new libraries of compounds

  2. Contribution of ZeptoNet’s model on performances in training on new bioassays

  3. Contribution of ZeptoNet’s model on accelerating active learning hit discovery on new bioassays.

Intrinsic ZeptoNet performance

State-of-the art results, used as a basis of comparison with ZeptoNet’s performance, are achieved through multi-layer neural networks trained individually on each bioassay library. This baseline model uses fingerprints and chemical descriptors as compound features. The results are evaluated using the Precision-Recall area under curve, normalised by the hit rate of each compound library. The below graph depicts a random selection of 52 of the bioassays used in training. Bioassays ranked and compared on a logarithmic scale (the higher the better). In most cases ZeptoNet outperformed the state of the art by a wide margin.

Leveraging ZeptoNet’s universal model for new targets

ZeptoNet’s pretrained universal model helps in accelerating training time as well as increasing the performances of a model trained to predict the bioactivities of a new bioassay. The models are trained on 90% of the available data and evaluated on the remaining 10%, using the Enrichment Factor at various selection of top-ranking compounds.

Leveraging ZeptoNet’s universal model for active learning

ZeptoNet’s pretrained universal model helps accelerate hit discovery in an active learning approach allowing higher enrichment in early iterations in comparison to baseline models used in the same fashion. The active learning approach is evaluated by virtually simulating the iterative training, using an initial random selection of 10% of the libraries for each of the tested new targets, and enriching the discovery with a 15% exploitation rate from the top scoring compounds and 5% exploration rate of the remaining of the library for each iteration. Results are shown after first and second iteration, using the Enrichment Factor on the total percentage of screened compound at each iteration, and demonstrate both the capacity of an active learning approach to accelerate hit discovery, as well as the contribution of Zeptonet’s model in an increased performance and higher hit discovery rate in comparison to the state-of-the art baseline.

Benefits

ZeptoNet outlines the following key advantages for pharmaceutical and biotech companies:

1. Reducing the time and cost of drug discovery: by efficiently learning cross-domain, ZeptoNet requires relatively limited set-up and data to start generating results. This results in significant time and cost savings.

2. Reducing clinical failure rate: ZeptoNet can be used to predict unexpected side effects that might arise during advanced clinical trials.

3. Getting better results in terms of likely hits: by leveraging different data types and learning cross-domain, ZeptoNet can generate better predictions of likely hits.

4. Reducing the number of compounds that need to be physically screened: the physical screening of compounds can be an expensive, slow and a wasteful process. Some bioassays in particular make use of nuclear material or rare metals such as gold, platinum, etc. In most cases, only an insignificant part of the materials can be recycled due to the risk of biological contamination.

5. Open access to research of rare diseases: Due to the high cost of searching for drugs, many rare diseases go untreated. By attempting to significantly reduce a set of the drug discovery process’ cost, ZeptoNet can help discover drugs for rare diseases.

6. Searching larger compound libraries: by being computationally inexpensive, ZeptoNet can be used to efficiently search very large compound libraries to discover novel hits.

Partnerships

Kantify is currently developing partnerships with pharmaceutical and biotech companies and academic teams willing to accelerate their drug discovery process. For more information, don’t hesitate to contact our CEO, Ségolène Martin (segolene@kantify.com) or our CTO, Nik Subramanian (nik@kantify.com).