Machine Learning for DNA Encoded Libraries
Our February Discovery Series discussion group centred on DNA Encoded Libraries (DEL). Using DEL enables the simultaneous screening of thousands of potential drugs against a target. DEL can help identify chemicals that affect the target (hits) and identify chemicals that have therapeutic potential, flagging them for future research.
Oxford Global’s discussion groups bring together a select group of 15-20 key industry leaders for approximately one hour for in-depth knowledge sharing and conversation. Taking the lead and presenting on machine learning applications for DEL was Joe Franklin (VP Chemistry, Anagenex). Joe is an experienced pioneer in DNA Encoded Library chemistry, having worked on, built, and led DEL teams for over 15 years.
The standard process seen in DEL screening is to make or acquire libraries, then screen those libraries against a protein of interest. At the end of that process, you dilute your molecules and move to PCR and sequencing. The next stage is data analysis and deciding what to do with the information gathered. From that analysis, a set of compounds can be selected and tested. If the tests are successful, this leads to the ‘hit to lead’ stage of drug development.
DEL Discovery: Library Construction and Selection
Several methods can be used to build a DNA-encoded library. The most common way uses DNA to act similarly to a bar code. Attaching a short piece of DNA to an organic functional group allows ‘building blocks’ to be easily identified. Joe explains that “the barcode is only a recipe; it’s not actually a compound. You need to have quality protein. If you don’t have quality protein, for example, if it’s aggregated misfolded, you’re going to find compounds that bind to it but are not going to give you the function you want.”
If there is a very specific desired mode of action, it’s vital to understand how conditions can impact the protein. Joe makes the point that you need to “make sure your buffers are suitable. If you need a cofactor, make sure everything in your selection is considered against the protein. You may need blocking reagents or other approaches to prevent nonspecific DNA binding. And then; motive action. There are ways to make selections for binding, and there are ways to make selections for function. If you’re building a DEL programme, you want to think about which one of those is more important.
At Anagenex, we put a million copies of every DEL molecule into our selection, and we do multiple selection conditions for every protein. But every selection condition gets a million molecules of every library. So, our goal is to get a really clear picture of what’s going on and ensure our signal is way above our background.”
DEL & Machine Learning
Once you have the selection data and hits that you want to interrogate, the question is, how do you introduce machine learning to DEL?
Joe explains that “there are many ways to use machine learning, but there’s two that I’m really keen on. The first is ensuring the best way to spend building block budgets. Some of us have large building block budgets, others have small building block budgets, and how to spend that is often a challenge.
Suppose you took a simple case, like amines, where tens of thousands are commercially available. You could cluster these amines, pick a minimal set of amines that covers maximal diversity from the commercial catalogue, validate it in your reactions, and then use machine learning on the output of that validation to predict what other amines to buy. So, for example, if you buy 100 building blocks, and 50 of them pass, then use that information for machine learning to then predict what other amines to buy with an 80% success rate, you’re spending money much better.”
This works by training a machine learning model on data from a demo screen. This model can then predict what molecules from the billions available in commercial catalogues might engage the target of interest. The main benefit of this is that it can enable access to chemical diversity that does not exist in your DEL. This approach is now widely used, but Anagenex differentiates itself by introducing a second stage of machine learning.
ML1 and ML2
Joe explains that “after we predict from commercially available compound sets, we also use that model to predict if we have a certain library or a chemical space that the model thinks the target will react well to. We use the machine learning model to predict what that library would be; we then make that library and select it.”
Joe continues by describing how Anagenex decide on their second set of compounds; “We create a new model based on the data from this evolved library, and we use these two data streams to train a new model. We predict from these large compound sets, then test that data. We call the first set of compounds we buy from the first exercise ML1 and the second set of compounds ML2. This process takes about four months; the slowest part is getting these compounds. It usually takes close to a month to get each of the two compound sets. But that is still fast for getting high-quality molecules that are potent to target. I want to stress that we don’t do this for all of our programmes. But when we have done it, and it’s been multiple times, it works really well. And so, what I want to impress upon you is that at each stage of the process, the molecules get better.”
The industry-wide hit rate for DEL libraries using machine learning to predict against commercially available compound libraries is around 20%. Using this second stage of machine learning, Anagenex’s hit rate went up to 58%, with increased diversity of chemical matter.
Final Thoughts and Conclusion
At Oxford Global, we could not have been more pleased with the turnout for our DNA encoded library discussion group. Joe’s presentation was illuminating and created an excellent basis for further discussion. For more on this subject, you can read about?Dr Iolanda Micco’s (Associate Director of Chemistry & Alliances at Vipergen) work on using DEL screening with living cells here.