Home Insights Data Science & AI Finding new COVID-19 treatments with graph convolutional networks (GCN)

Finding new COVID-19 treatments with graph convolutional networks (GCN)

Danilo Djekic

February 16, 2021 • 15 min read

Table of Contents

Problem statement
Data
Conclusion

It takes on average 10 years and $2 billion to develop a single FDA approved drug. This drug development pipeline is long and rigorous for a good reason, but it is also too slow to react to sudden global threats such as the Covid-19 outbreak.

An alternative to discovering a new drug from scratch is drug repurposing – finding drugs which have already passed clinical trials for treatments of other diseases, and which can also be effective in treating the new one.

This way, the drug’s safety risk is already known and well-studied, and all that’s left is to test its ability to treat a new disease.

But how do we know which known drugs can be repurposed?

In this article, we present a method for modeling and predicting interactions between drugs and different targets inside a human body. Finally, we use this model to find new potential drug candidates for the treatment of SARS-CoV-2 which have already passed clinical trials.

Covid-19 vaccines are already being distributed, but global vaccine distribution cannot be done in a day. And with all the difficulties of vaccine distribution, a sad reality is that hundreds of thousands of people worldwide will still suffer from SARS-CoV-2 virus.

Problem statement

We are not a chemists, nor a biologists, and you don’t need to be one either to understand this article. But a short domain introduction is needed to explain the problem at hand.

After a new disease is discovered, scientists first investigate the mechanisms in which this disease operates. Is it a bacteria? Is it a virus? What kind? How does it operate inside a human body? And finally, what causes the severe symptoms or perhaps death?

Once the disease is understood well enough, the next step towards finding a treatment is to identify a target.

Target – A molecule in the body, usually a protein, that is somehow associated with a particular disease and that could be affected by a drug to produce a desired therapeutic effect.

After identifying an appropriate target, the next step is to find compounds which can bind with this target inside the body. This is where things get especially challenging (as if anything before was easy) because scientists need to check which compounds (out of hundreds of millions of possible candidates) could bind with the desired target.

Compound (in this article) – A molecule which can be synthesized or collected from nature easily and cheaply enough to be mass produced and used as a drug.

Even though scientific intuition and previous research play a role here, this step is largely done through brute force. Simply put, they use robots to mix hundreds of thousands of different compounds at different concentrations with the target, and measure how well they bind. This brute force technique is called high-throughput screening.

High–throughput screening (HTS) – A drug discovery process that allows automated testing of large numbers of chemical and/or biological compounds for a specific biological target.

This is arguably the most expensive part of a drug discovery process, and is the one where machine learning can really step in.

The final measurement of how well a compound binds a target is called a dissociation constant, and is usually denoted Kd.

Kd (Dissociation constant) – A real numbered measurement used to represent the binding affinity between a drug and a target. (the lower the value – the greater the binding)

You can read more on how this value is calculated here.

An ideal drug candidate would be a compound which has a high binding affinity (low Kd) with the desired target, but low binding affinity (high Kd) with all other known biological targets. This minimizes the risk that the drug will interact with other targets and cause unwanted side effects.

Our task is to build a model which can predict the Kd value for compound-target pairs, and then make predictions on targets which are somehow associated with SARS-CoV-2.

Data

Drug-target interaction dataset is available at DrugTargetCommons. It contains drug and target pairs (marked with unique identifiers) and their corresponding Kd value.

To avoid predicting potentially large numbers, a more commonly used measurement is the natural logarithm of the Kd value (kdlog above).

Using compound_id as key, we can join our dataset with ChEMBL Database. This would give us access to the chemical representation for each of our compounds and information on how far in clinical studies this compound has reached.

Max Phase simply refers to the phase of clinical trials this compound has reached. It ranges from 0-4 (0- still in research phase; 4- approved for use in humans by FDA/EMEA). We will need this in order to filter only already approved drugs later on.

Chemical structure of the compound is denoted by a SMILES representation – a string representation of a compound’s molecular structure.

One might be tempted to use NLP techniques here and encode the smiles representation as they would a textual feature. I encourage you to try different approaches, however, there is an overwhelming consensus that smiles representation alone doesn’t capture the necessary information about molecular functional attributes.

So instead of tokenizing SMILES, it would be better to make use of the RDKit library.

RDKIt is an open-source chemoinformatics library in Python, which features many useful tools and functions for analyzing chemical data.

In our case, we can use SMILES to build the RDKit library’s Mol() object, and then use the FragFPGenerator() to break down our compounds into functional fragments- parts of the molecule associated with some chemical function.

After extracting 1700 distinct functional fragments, we can count how many times each functional fragment is present in each compound. This way each compound now has a 1700-dimensional feature vector, where each feature is an integer denoting the number of times a given functional fragment repeats itself in a molecule.

So let’s summarize what we have so far:

1. We have Kd values for distinct compound-target pairs

Number of compounds: 11555

Number of targets: 1286

Number of known pairs: 24582

2. We have a 1700-dimensional feature vector for each compound.

Methodology

An idea of taking the convolution operation, commonly used for image processing, and extending it to work on graphs has first been proposed by Defferrard et al.( NIPS 2016) and Kipf & Welling (ICLR 2017). More recently, Van den Berg & Kipf (2017) proposed the extension of this methodology for matrix completion problems. My method is inspired by the same general ideas, however it differs in many important regards. It would be helpful to start and describe the graph convolution concept from the ground up first. (NOTE: By “graph convolution” we will be referring to the message-passing variant of the methodology described by Kipf & Welling (ICLR 2017), rather than the spectral version of the graph convolution described by Defferrard et al.( NIPS 2016.) )

Let’s imagine a simple graph with 4 nodes, where each node has an associated real-numbered feature X:

Now let’s imagine a process in which each node in a graph is updated by the average of the features of the surrounding nodes. For example, a node in the top right corner of the image would be updated by the feature values from its neighboring nodes, as shown below:

The resulting number would be the average of the entire local neighborhood of the node. We can repeat this process for all nodes in the graph, thus finishing one message passing step.

This process can be repeated as many times as we like. Each time we do it, the information from all nodes in a graph gets slowly propagated towards all the other nodes.

In the resulting graphs, each node now encodes it’s local neighborhood in some way. In other words, if the node has a feature which somehow describes the node, after message-passing the resulting feature now describes the local neighborhood of the node.

This is where the analogy with standard CNNs becomes more obvious. In image processing, convolution filters allow us to extract information not only from individual pixels, but their local pixel neighborhood as well.

Much like in CNN, where convolutional layers are followed by few dense layers, the idea is that if we can somehow accumulate the information about the neighborhood of the nodes and feed this information to a dense neural network. Hope is that this neural network will be able to model patterns in the graph substructure.

Let’s demonstrate how this could be used for the problem of edge detection:

We have a graph, and our task is to predict whether two nodes are linked or not. This problem is commonly faced by social networks, for instance, where it’s useful to predict which friend to recommend to a user.

It would make sense to think that whether a user would like to be friends with another user depends not only on the descriptive features of the users, but also on what kind of friends these users have, i.e, their local graph neighborhood.

So in general, the probability of nodes A and B being linked is some function F of A’s features, B’s features, and their updated values after message passing:

So, following the logic of regular CNN, we aggregate all of this information by assigning trainable weights to each convolution output and feed the resulting vector as input to a simple dense neural network.

With this intuition in mind, let’s frame our drug -target interaction problem as an edge regression problem on a bipartite graph:

We can represent our drug-target interaction dataset as a bipartite graph where the length of an edge between a target and a compound corresponds to their Kd value (not properly represented on the image). On top of that, instead of having one-dimensional node features like in our toy graph before, each compound node has a 1700-dimensional feature vector, while our target nodes have no features at all.

Right of the bat, you may notice two differences between the toy example we examined before and our problem setup:

Only some nodes have features (compound nodes)
We are not just doing edge detection, but edge regression. Edge detection was predicting whether there is an edge between nodes or not. Edge regression is predicting the exact length of the edge.

Let’s visualize how graph convolutions would look like in this setup.

Since we only have features for one partition of nodes, the first message passing must go from compound nodes to target nodes.

After the first convolution step, each target node contains a vector XI which represents the average compound feature from all compounds connected to that target.

Let’s make another convolution step:

Now every compound node represents the averageof the averages of compounds connected to thetarget- and this can go on…

So far, we have been talking about arithmetic mean as the way to collect the information from neighboring nodes in a graph. But instead of a simple average, it would make sense to give more or less weight to messages from different nodes, depending on how strongly they are connected. Through experimentation, we determined that giving more weight to nodes that are connected with higher Kd values works best.

So the model pipeline for predicting the binding affinity between a compound c and a target t would look like this:

Run the graph convolution step n times
Aggregate all n graph representations for c and t
Feed these representations to a dense neural network and predict Kd

The convolution steps in matrix form look like this:

If A is the weighted adjacency matrix of our graph,

*A-adjacency matrix [n-compounds x n-targets]*

and X is the compound feature matrix,

*X-compound feature matrix [n-compounds x 1700]*

Convolution zero would simply be the features of each compound:

$$ \begin{aligned} \large Conv0 = X \end{aligned} $$

[n_compounds x n_features]

First convolution would be the average of all compound features in the neighborhood of each target, weighted by the Kd value between them:

$$ \begin{aligned} \large Conv1 = softmax _{dim = 0} (A)^ᵀX \end{aligned} $$

[n_targets x n_features]

And second convolution would be the average of the averages of compound features per target per feature, weighted by the Kd value between them:

$$ \begin{aligned} \large Conv2 = softmax_{dim = 1} (A)Conv1 \end{aligned} $$

[n_compounds x n_features]

As you can see by reading the dimensions of our convolutions, a compound node is only represented in 0th and 2nd convolutions, while a target node is only represented in the 1st.

This makes intuitive sense since. As we said, the 0th convolution is just compound features per compound; the first convolution is the average compound features per target; and the second convolution is the average average compound feature per target per compound. So a specific compound c can only be indexed in the 0th and 2nd convolutions, and a specific target t can be indexed in the 1st.

At last, the full convolutional layer for a compound c and a target t can be represented as shown:

$$
\begin{aligned}
\large tanh(Conv_{0[c]} W_0 + Conv_{1[t]} W_1 + Conv_{2[c]} W_2)
\end{aligned}
$$

What we end up with is a hidden state vector whose length depends on the shape of the parameter weights W. The resulting vector is then used as input to a fully connected linear layer after passing through dropout and batch normalization.

Model training

The validation split is done by randomly masking 3 interactions for all compounds which have more than 10 interactions in total.

Since an ideal drug candidate would be a compound which has a low Kd with the desired target, but high Kd with all other known biological targets, we are mostly interested in relative Kd values. For this reason, the chosen model evaluation metric will be coefficient of determination (R2 ):

$$
\begin{aligned}
\large R^2 = 1 – \frac {SS_{RES}}{SS_{TOT}} = 1 – \frac {\sum_i(y_i – \hat y_i)^2}{\sum_i(y_i – \bar y_i)^2}
\end{aligned}
$$

Which gives us the proportion of the variance of the true Kd value that our model can explain.

We train the model for 60 epochs with early stopping.

The final coefficient of determination on our validation set is 0.642. It helps to visualize the goodness of fit by comparing our model predictions to the random predictions of the same range.

Predictions on COVID-19 targets

Weisberg et. al. (2020) listed over 30 different kinase inhibitors which could have antiviral or pulmonary benefits in treatment of SARS-CoV-2.

Kinases are types of proteins, very commonly used as drug targets. In fact, all targets in our interaction dataset are kinases. Kinase inhibitors are drugs which bind and inhibit the activity of a kinase.

Source: Weisberg et. al. (2020)

The article also lists over 16 kinases whose inhibition is associated with either antiviral or pulmonary benefits, and we will use these targets to make predictions and find other potential drug candidates.

What we are interested in is finding compounds which would have a low Kd with a given target relative to it’s Kd with all other targets in our dataset. In other words, for each drug in our dataset which has an FDA approval (max phase = 4), we will make predictions on all targets in our dataset, standardize these predictions, and extract top five compounds with the lowest standardized score.

Kinase name	Inhibitor-associated activity	Top 5 compounds with lowest standardized scores
AAK1	Anti-inflammatory and cytokine suppression	Sumatriptan, Tolterodine, Ethinyl estradiol, Estradiol, Zolmitriptan
‘TYK2’	Anti-inflammatory, cytokine suppression, antifibrotic	‘Docetaxel’, ‘Mometasone furoate’, ‘Digoxin’, ‘Indinavir’, ‘Estrone’
‘ABL2’	Anti-inflammatory, cytokine suppression, antifibrotic	‘Bosentan’, ‘Terazosin’, ‘Sulfamethoxazole’, ‘Mirtazapine’, ‘Doxazosin’
‘AXL’	Anti-inflammatory, cytokine suppression, antifibrotic	‘Sumatriptan’, ‘Alitretinoin’, ‘Terbutaline’, ‘Theophylline’, ‘Salmeterol’
‘YES’/’YES1’	Anti-inflammatory, cytokine suppression, antifibrotic	‘Hydrocortisone hemisuccinate’, ‘Ethinyl estradiol’, ‘Alfuzosin’, ‘Hydrocortisone’, ‘Docetaxel’
‘CSK’	Anti-inflammatory, cytokine suppression, antifibrotic	‘Estradiol benzoate’, ‘Hydrocortisone’, ‘Estradiol’, ‘Ethinyl estradiol’, ‘Deoxycholic acid’
‘EGFR’	Antifibrotic	‘Bosentan’, ‘Sulfamethoxazole’, ‘Acebutolol’, ‘Mycophenolic acid’, ‘Tropisetron’
‘FYN’	Anti-inflammatory, cytokine suppression, antifibrotic	‘Bosentan’, ‘Ketanserin’, ‘Ethinyl estradiol’, ‘Hydrocortisone hemisuccinate’, ‘Docetaxel’
‘GAK’	Anti-inflammatory, cytokine suppression, antifibrotic	‘Lovastatin’, ‘Dorzolamide’, ‘Dexamethasone’, ‘Mometasone furoate’, ‘Brinzolamide’
‘JAK1’	Anti-inflammatory and cytokine suppression	‘Dexamethasone’, ‘Maraviroc’, ‘Mometasone furoate’, ‘Hydrocortisone hemisuccinate’, ‘Tropisetron’
‘JAK2’	Anti-inflammatory and cytokine suppression	‘Estrone’, ‘Tolfenamic acid’, ‘Sumatriptan’, ‘Ethinyl estradiol’, ‘Indinavir’
‘JAK3’	Anti-inflammatory and cytokine suppression	‘Estrone’, ‘Indinavir’, ‘Ethinyl estradiol’, ‘Docetaxel’, ‘Nelfinavir’
‘KIT’	Anti-inflammatory, cytokine suppression, antifibrotic	‘Sumatriptan’, ‘Indomethacin’, Loratadine’, ‘Yohimbine’, ‘Dopamine’
‘LYN’	Anti-inflammatory, cytokine suppression, antifibrotic	‘Doxorubicin’, ‘Bergapten’, ‘Mycophenolic acid’, ‘Alfuzosin’, ‘Terbutaline’
‘RET’	Anti-inflammatory and cytokine suppression	‘Ethinyl estradiol’, ‘Estrone’, ‘Estradiol benzoate’, ‘Fenoldopam’, ‘Mianserin’
‘SRC’	Antifibrotic	‘Salmeterol’, ‘Paroxetine’, ‘Nalmefene’, ‘Etoposide’, ‘Podofilox’s

We can see that some of the drugs show up more than once, these would potentially be the most promising treatments since they could serve as multi-target inhibitors.

Ethinyl estradiol

Docetaxel

Sumatriptan

Estrone

Indinavir

Bosentan

Mometasone furoate

Hydrocortisone hemisuccinate

A very encouraging fact is that a lot of the findings are already proposed or corroborated in the literature. This is encouraging because it validates the results we obtained. However the most interesting findings are those for which there is no existing reference in relation with SARS-CoV-2. One of these appears to be Docetaxel which is a known chemotherapy medication used to treat a number of different types of cancer. The use of docetaxel in cancer treatments makes this candidate particularly interesting for cancer patients who contracted SARS-CoV-2, since they represent a group with high morbidity risk. While there is no direct reference linking Docetaxel to SARS-CoV-2, Derosa et.al. (Nature Cancer 2020) mention cancer therapies which also act as JAK1/2 inhibitors, as potentially useful in treating SARS-CoV-2 which is exactly what our model linked Docetaxel to.

Conclusion

The purpose of this project, aside from screening for potential SARS-CoV-2 treatments, is to test a new method of in silico High-Throughput screening (or prescreening we should say). It was refreshing to see that our model came up with predictions previously suggested by medical literature, and even more exciting to see some new, previously unseen, drug suggestions.

We believe this alone proves that methods such as the one used above, have the potential to improve the speed and efficiency of finding new drug leads, thus substantially reducing the cost of drug research and development.

The variant of a graph neural network we used is inductive in nature, meaning it can be used to make predictions on targets and compounds unseen during training without repeating the training process . Other methods used for similar matrix completion problems, such as matrix factorization methods, need to be retrained every time a new drug or target is introduced to a dataset.

In conclusion, in silico drug screening using machine learning is bound to become an industry standard in the future. Methods, like the one described above, have the potential to directly speed up the drug development pipeline, but also lower its cost and make the resulting treatments more affordable.

References:

1. Jarmoskaite, I., AlSadhan, I., Vaidyanathan, P. and Herschlag, D., 2020. How to measure and evaluate binding affinities. eLife, 9.

2. DrugTargetCommons 2021. Home. [online] Available at: <https://drugtargetcommons.fimm.fi/> [Accessed 8 February 2021].

3. Ebi.ac.uk. 2021. ChEMBL Database. [online] Available at: <https://www.ebi.ac.uk/chembl/> [Accessed 8 February 2021].

4. Rdkit.org. 2021. Python API Reference — The RDKit 2020.09.1 documentation. [online] Available at: <https://www.rdkit.org/docs/api-docs.html> [Accessed 8 February 2021].

5. Defferrard M., Bresson X., Vandergheynst P. 2016. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering arXiv:1606.09375

6. Kipf T., Welling M. 2017 Semi-Supervised Classification with Graph Convolutional Networks arXiv:1609.02907

7. Berg R., Kipf T., Welling M. 2017. Graph Convolutional Matrix Completion arXiv:1706.02263

8. Weisberg, E., Parent, A., Yang, P., Sattler, M., Liu, Q., Liu, Q., Wang, J., Meng, C., Buhrlage, S., Gray, N. and Griffin, J., 2020. Repurposing of Kinase Inhibitors for Treatment of COVID-19. Pharmaceutical Research, 37(9).

9. Algara A. 2021. Estrogen Therapy in Non-severe COVID-19 Patients [online] Available at: <https://clinicaltrials.gov/ct2/show/NCT04539626/> [Accessed 8 February 2021]

10. Seeland, U., Coluzzi, F., Simmaco, M., Mura, C., Bourne, P., Heiland, M., Preissner, R. and Preissner, S., 2020. Evidence for treatment with estradiol for women with SARS-CoV-2 infection. BMC Medicine, 18(1).

11. Arca, K. and Starling, A., 2020. Treatment-Refractory Headache in the Setting of COVID-19 Pneumonia: Migraine or Meningoencephalitis? Case Report. SN Comprehensive Clinical Medicine, 2(8), pp.1200-1203.

12. Li, Y., Jerkic, M., Slutsky, A. and Zhang, H., 2020. Molecular mechanisms of sex bias differences in COVID-19 mortality. Critical Care, 24(1).

13. Sang, P., Tian, S., Meng, Z. and Yang, L., 2020. Anti-HIV drug repurposing against SARS-CoV-2. RSC Advances, 10(27), pp.15775-15783.

14. Hall, D. and Ji, H., 2020. A search for medications to treat COVID-19 via in silico molecular docking models of the SARS-CoV-2 spike glycoprotein and 3CL protease. Travel Medicine and Infectious Disease, 35, p.101646.

15. Chang, Y., Tung, Y., Lee, K., Chen, T., Hsiao, Y., Chang, H., Hsieh, T., Su, C., Wang, S., Yu, J., Shih, S., Lin, Y., Lin, Y., Tu, Y., Hsu, C., Juan, H., Tung, C. and Chen, C., 2020. Potential Therapeutic Agents for COVID-19 Based on the Analysis of Protease and RNA Polymerase Docking.

16. Halder, U., 2020. Predicted antiviral drugs Darunavir, Indinavir and Rimantadine can potentially bind to neutralize COVID-19 conserved proteins.

17. Badagliacca, R., Sciomer, S. and Petrosillo, N., 2020. Endothelin receptor antagonists for pulmonary arterial hypertension and COVID-19: Friend or foe?. The Journal of Heart and Lung Transplantation, 39(7), pp.729-730.

18. Docherty, A., Harrison, E., Green, C., Hardwick, H., Pius, R., Norman, L., Holden, K., Read, J., Dondelinger, F., Carson, G., Merson, L., Lee, J., Plotkin, D., Sigfrid, L., Halpin, S., Jackson, C., Gamble, C., Horby, P., Nguyen-Van-Tam, J., Ho, A., Russell, C., Dunning, J., Openshaw, P., Baillie, J. and Semple, M., 2020. Features of 20 133 UK patients in hospital with covid-19 using the ISARIC WHO Clinical Characterisation Protocol: prospective observational cohort study. BMJ, p.m1985.

19. Scandinavian Critical Care Trials Group 2021. Hydrocortisone for COVID-19 and Severe Hypoxia (COVID STEROID) [online] Available at: <https://clinicaltrials.gov/ct2/show/NCT04348305/> [Accessed 8 February 2021]

20. Derosa, L., Melenotte, C., Griscelli, F. et al. The immuno-oncological challenge of COVID-19. Nat Cancer 1, 946–964 (2020). https://doi.org/10.1038/s43018-020-00122-3

Finding new COVID-19 treatments with graph convolutional networks (GCN)

Problem statement