β
There's substantial work being done in the computational protein space since 2021. We are starting from first principles to build towards our vision.
We might independently arrive at conclusions drawn already by other teams - we consider that a strong positive reinforcement rather than a negative data point.
We need to figure out a few key components:
-
π―One representation learning method to rule them all π‘:
The current SOTA methods still can't produce inclusive protein representations that can incorporate all facets of proteins from sequence, structure, temporal dynamics, post-translational modifications, single point mutations, additional bound ligands/substrates, protein complexes, etc. into one representation learning scheme.
This limits us to only extrapolate knowledge from one or a limited set of domains.
We are investigating if Joint Embedding Architecture (popularized by Yann Lecun) is the way forward.
Our first experiment is to simply integrate sequence and structure (since it's quite standardized at this point).
Our work: https://github.com/atom-51/jespr.
-
One pre-training objecting to bring them all π€:
We need to find the pre-training objective for protein learning which is as powerful as 'next word prediction' for language.
Our bet currently is on contrastive loss for protein representations in different modalities.
This also ties into our first point on how to learn more inclusive representations.
-
And in the lightπ₯, one protein-landscape mapping to bind themβοΈ:
Target discovery is absolutely crucial.
We can now almost reliably sequence and predict 3-D structures of all proteins.
By adding a few key pieces such as information for spacial transcriptomics β proteomics to extrapolate cellular localization and interaction prediction, we should start accurately mapping the Gene Ontology landscape.
We're still exploring how best to approach this problem.