Back to Blog January 25, 2025

Building Protein Heaven: A Research-Grade Molecular Folding Platform

Building an open, AI-accelerated molecular folding simulator designed to push beyond current distributed computing efforts and advance biological research.

Building Protein Heaven: A Research-Grade Molecular Folding Platform

Molecular folding is one of the most computationally demanding problems in biology. Understanding how proteins, DNA, and RNA fold into their functional three-dimensional structures has implications for drug discovery, disease treatment, and synthetic biology. Projects like Folding@home have made incredible progress by harnessing distributed computing power from volunteers worldwide. But I believe we can do better.

Protein Heaven is my attempt to build a next-generation molecular folding platform that combines AI-driven simulation with distributed computing, designed from the ground up to be research-grade and community-powered.

The Vision

Current distributed folding projects rely heavily on traditional molecular dynamics, which simulates physics at the atomic level. This is accurate but computationally brutal. Recent advances in AI (like DeepMind’s AlphaFold and Meta’s ESMFold) have shown that neural networks can predict protein structures directly, but they treat folding as a static prediction problem rather than simulating the dynamic process itself.

What if we combined both approaches? Use AI not just to predict final structures, but to accelerate the simulation itself. This is the direction projects like neural network potentials and equivariant message passing networks are heading. I want to take it further.

The goal is ambitious: build a platform that can simulate molecular folding faster and more accurately than existing solutions, starting with proteins but expanding to DNA and RNA. And critically, make it open and distributed so researchers and volunteers worldwide can contribute computational power to advance science.

Why Build This?

Folding@home is incredible, but there’s room for innovation. Their architecture was designed over two decades ago. Modern AI techniques, GPU computing paradigms, and distributed systems have evolved dramatically. A ground-up redesign can leverage these advances.

AI can do more than predict. Projects like NeuralMD and TorchMD-Net are showing that neural networks can learn force fields that are orders of magnitude faster than quantum mechanical calculations while maintaining high accuracy. I’m building on these foundations.

The research community needs better tools. Many labs lack the computational resources to run large-scale simulations. A community-powered platform could democratize access to research-grade molecular dynamics.

Beyond proteins. DNA and RNA folding are equally important for understanding gene regulation, RNA therapeutics, and nucleic acid nanotechnology. Most current tools focus narrowly on proteins. Protein Heaven is designed to be molecule-agnostic.

The Architecture

AI-Accelerated Force Fields

Traditional force fields like AMBER and CHARMM use hand-tuned parameters derived from experiments and quantum calculations. They’re accurate but slow. I’m training neural network potentials that learn directly from high-accuracy quantum mechanical data:

class NeuralForceField(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = EquivariantGraphNetwork(
            hidden_dim=256,
            num_layers=6,
            cutoff=5.0  # Angstroms
        )
        self.energy_head = nn.Linear(256, 1)
        self.force_head = ForceOutput()  # Computes forces as energy gradients

    def forward(self, positions, atom_types, batch):
        # Equivariant encoding preserves rotational symmetry
        node_features, edge_features = self.encoder(positions, atom_types)

        # Energy prediction
        energy = self.energy_head(node_features.sum(dim=0))

        # Forces as negative gradient of energy
        forces = -torch.autograd.grad(energy, positions, create_graph=True)[0]

        return energy, forces

Early benchmarks show 100x speedup over DFT calculations with comparable accuracy for the systems I’ve tested.

Adaptive Sampling with Learned Guidance

The hardest part of molecular simulation isn’t computing forces; it’s sampling the vast conformational space efficiently. Most simulations waste compute exploring irrelevant configurations. I’m training a guidance network that predicts which regions of conformational space are most likely to contain important transitions:

class FoldingGuide(nn.Module):
    """Predicts promising directions for conformational exploration"""
    def __init__(self):
        super().__init__()
        self.structure_encoder = SE3Transformer(num_layers=4)
        self.sequence_encoder = ProteinBERT()
        self.transition_predictor = nn.Sequential(
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 3 * MAX_RESIDUES)  # Displacement vectors
        )

    def forward(self, coords, sequence, current_energy):
        struct_features = self.structure_encoder(coords)
        seq_features = self.sequence_encoder(sequence)
        combined = torch.cat([struct_features, seq_features, current_energy], dim=-1)
        suggested_moves = self.transition_predictor(combined)
        return suggested_moves.reshape(-1, 3)

Distributed Computing Framework

For the distributed component, I’m building a work distribution system that can split simulations across thousands of contributors:

class DistributedSimulation:
    def __init__(self, molecule, num_replicas=1000):
        self.molecule = molecule
        self.replicas = self.initialize_replicas(num_replicas)
        self.coordinator = ReplicaExchangeCoordinator()

    def distribute_work(self, available_nodes):
        """Assign simulation chunks to contributor nodes"""
        work_units = []
        for replica in self.replicas:
            work_unit = WorkUnit(
                replica_id=replica.id,
                current_state=replica.state,
                steps_requested=10000,
                checkpoint_interval=1000
            )
            work_units.append(work_unit)

        return self.coordinator.assign(work_units, available_nodes)

    def aggregate_results(self, completed_units):
        """Collect results and perform replica exchange"""
        for unit in completed_units:
            self.replicas[unit.replica_id].update(unit.final_state)

        # Attempt exchanges between replicas at different temperatures
        self.coordinator.attempt_exchanges(self.replicas)

The system is designed to be fault-tolerant (nodes can drop out without losing progress), verifiable (results are validated before being accepted), and efficient (minimal data transfer between coordinator and contributors).

Current Progress

I’ve completed the core simulation engine and AI force field training pipeline. Current capabilities:

Proteins up to 500 residues with full atomic detail
DNA/RNA support in beta testing
GPU acceleration achieving 10M timesteps/day on consumer hardware
Neural force fields trained on 50,000 DFT calculations

Validation against experimental structures (from PDB and recent CASP targets) shows competitive accuracy with state-of-the-art methods, with significant speed advantages for longer simulations.

The Road Ahead

Phase 1 (Current): Core engine development, AI model training, validation against experimental data.

Phase 2: Launch distributed computing client. Volunteers can download the client and contribute spare GPU/CPU cycles to ongoing research simulations.

Phase 3: Open research platform. Researchers can submit molecules for simulation, access results, and contribute improvements to the codebase and models.

Phase 4: Expand to larger systems (protein complexes, membrane proteins, chromatin) and longer timescales (millisecond dynamics).

The protein folding problem isn’t solved. AlphaFold predicts structures brilliantly, but understanding the folding process (how misfolding causes disease, how to design proteins that fold reliably, how nucleic acids regulate gene expression through structural changes) requires simulation. Protein Heaven is my contribution to that frontier.

Get Involved

Protein Heaven will be fully open source. I’m looking for:

Contributors interested in molecular simulation, ML for science, or distributed systems
Researchers who want to test the platform on their systems of interest
Compute donors who want to contribute to scientific research

If you’re interested in pushing the boundaries of what’s possible in molecular simulation, let’s connect. This is too big to build alone.

Building research infrastructure for the scientific community. Reach out if you want to be part of this.