Back to Blog January 20, 2025

GPU Workload Distribution Patterns

From crypto mining to AI training: strategies for efficiently distributing ML workloads across a homelab GPU cluster.

GPU Workload Distribution Patterns

I have a confession: my GPU cluster started as a crypto mining operation.

Back when Ethereum mining was profitable, I built a sizeable homelab filled with NVIDIA 3000 series GPUs. RTX 3080s, 3090s, cards stacked in custom rigs running 24/7. It was loud, it was hot, and it printed money. Then the Ethereum merge happened, proof-of-stake replaced proof-of-work, and suddenly I had a room full of expensive hardware with nothing to mine.

I switched some holdings to staking, but the GPUs? They needed a new purpose. That’s when I pivoted to AI training, and I haven’t looked back since.

From Mining to Machine Learning

The transition wasn’t as smooth as I expected. Mining is embarrassingly parallel: every GPU does the same thing independently. AI training is different. You need to coordinate GPUs, synchronize gradients, manage memory carefully, and handle the complexity of distributed systems.

My homelab now runs PyTorch, TensorFlow, and Axolotl for fine-tuning large language models. I’ve experimented with dozens of open-source models: Dolphin by Eric Hartford, GPT OSS 120B, Llama variants, Mistral, and many others. The 3000 series cards have 24GB of VRAM on the 3090s, which is enough to fine-tune 7B parameter models comfortably and run inference on much larger ones with quantization.

Why I Train My Own Models

The most interesting project I’ve worked on: training an AI on my own code.

I fed years of my personal projects, coding style, comments, and commit messages into a fine-tuning pipeline. The goal was to create a model that understands how I think about code, my naming conventions, my architectural preferences, my typical patterns. The result is a coding assistant that feels like it actually knows me.

I’ve also been experimenting with AI agents that can play video games, learning from gameplay footage and reward signals. It’s a fascinating application of reinforcement learning, and having local GPU power means I can iterate quickly without cloud costs eating into experiments.

The Heterogeneity Problem

Even in my homelab, not every GPU is identical. I have a mix of 3080s and 3090s, different memory capacities, and varying thermal characteristics (some cards throttle more than others under sustained load). Your distribution strategy needs to handle this heterogeneity gracefully.

In production environments, this gets even messier:

Mixed generations (V100s alongside A100s)
Different memory capacities (16GB, 32GB, 80GB)
Varying interconnects (PCIe vs NVLink)
Spot/preemptible instances that disappear without warning

Pattern 1: Data Parallelism

The simplest and most common pattern. Each GPU holds a complete copy of the model and processes different batches of data. Gradients are synchronized after each step.

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup(rank, world_size):
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

def train(rank, world_size, model, dataset):
    setup(rank, world_size)
    model = model.to(rank)
    model = DDP(model, device_ids=[rank])

    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)

    for batch in dataloader:
        loss = model(batch)
        loss.backward()  # Gradients automatically synchronized
        optimizer.step()

When to use: Models that fit in single GPU memory. Most computer vision and NLP fine-tuning tasks. This is my go-to for fine-tuning 7B models with Axolotl.

Watch out for: Communication overhead with large models. The all-reduce operation scales with model size, not data size.

Pattern 2: Model Parallelism

When your model doesn’t fit on a single GPU, you split the model itself across devices. Different layers live on different GPUs.

class PipelinedModel(nn.Module):
    def __init__(self):
        super().__init__()
        # First half on GPU 0
        self.encoder = nn.Sequential(...).to('cuda:0')
        # Second half on GPU 1
        self.decoder = nn.Sequential(...).to('cuda:1')

    def forward(self, x):
        x = self.encoder(x.to('cuda:0'))
        x = self.decoder(x.to('cuda:1'))
        return x

When to use: Large models that exceed single GPU memory. I use this when running inference on 70B+ parameter models across multiple 3090s.

Watch out for: Pipeline bubbles. While GPU 1 processes batch N, GPU 0 sits idle unless you implement micro-batching.

Pattern 3: Pipeline Parallelism with Micro-batching

The solution to pipeline bubbles: split each batch into micro-batches and keep all GPUs busy.

# GPipe-style pipeline parallelism
def forward_with_microbatches(model_chunks, batch, num_microbatches):
    microbatches = batch.chunk(num_microbatches)
    outputs = []

    for mb in microbatches:
        x = mb
        for chunk in model_chunks:
            x = chunk(x)
        outputs.append(x)

    return torch.cat(outputs)

Libraries like DeepSpeed and FairScale implement this efficiently with automatic gradient checkpointing and memory optimization.

Pattern 4: Tensor Parallelism

For massive matrix operations, split the tensors themselves across GPUs. A 4096x4096 weight matrix becomes four 4096x1024 shards.

# Simplified tensor parallel linear layer
class TensorParallelLinear(nn.Module):
    def __init__(self, in_features, out_features, world_size, rank):
        super().__init__()
        self.shard_size = out_features // world_size
        self.weight = nn.Parameter(
            torch.randn(in_features, self.shard_size)
        )
        self.rank = rank

    def forward(self, x):
        local_output = x @ self.weight
        # All-gather to combine shards
        gathered = [torch.zeros_like(local_output) for _ in range(world_size)]
        dist.all_gather(gathered, local_output)
        return torch.cat(gathered, dim=-1)

When to use: Individual layers that are too large for single GPU memory. Attention layers in very large transformers.

Handling Heterogeneous Hardware

Real clusters have mixed hardware. Here’s how I handle it in my homelab:

Dynamic Load Balancing

Assign work proportional to GPU capability:

def get_gpu_weights(gpu_info):
    """Weight GPUs by their relative performance"""
    weights = []
    for gpu in gpu_info:
        if 'RTX 3090' in gpu.name:
            weights.append(1.0)
        elif 'RTX 3080' in gpu.name:
            weights.append(0.75)  # Less VRAM, slightly lower throughput
        else:
            weights.append(0.5)
    return normalize(weights)

# Distribute batch sizes proportionally
batch_sizes = [int(base_batch * w) for w in get_gpu_weights(gpus)]

Memory-Aware Scheduling

Don’t send a batch that won’t fit:

def schedule_batch(batch, available_gpus):
    batch_memory = estimate_memory(batch)

    for gpu in sorted(available_gpus, key=lambda g: g.free_memory, reverse=True):
        if gpu.free_memory > batch_memory * 1.2:  # 20% headroom
            return gpu

    # No GPU has enough memory, split the batch
    return split_and_reschedule(batch, available_gpus)

Thermal Management

Mining taught me that sustained GPU loads generate serious heat. I monitor temperatures and throttle workloads before the hardware throttles itself:

def check_thermal_headroom(gpu):
    if gpu.temperature > 80:
        return 0.5  # Reduce workload
    elif gpu.temperature > 70:
        return 0.8
    return 1.0

# Adjust batch sizes based on thermal state
for i, gpu in enumerate(gpus):
    batch_sizes[i] *= check_thermal_headroom(gpu)

Monitoring and Profiling

You can’t optimize what you can’t measure. Essential metrics:

GPU utilization - Should be >90% during compute
Memory utilization - Track peak usage to right-size batches
Communication time - all-reduce and all-gather overhead
Temperature - Sustained training generates heat

# Using PyTorch profiler
with torch.profiler.profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=torch.profiler.schedule(wait=1, warmup=1, active=3),
    on_trace_ready=torch.profiler.tensorboard_trace_handler('./logs')
) as prof:
    for step, batch in enumerate(dataloader):
        train_step(batch)
        prof.step()

Real-World Results

Fine-tuning a 7B parameter model on my homelab cluster (4x RTX 3090):

Naive data parallelism: 52% GPU utilization, 6 hours for full fine-tune
With gradient accumulation: 81% utilization, 3.5 hours
With optimized batch distribution: 93% utilization, 2.8 hours

The hardware that used to mine Ethereum now trains models that help me write better code. I’d call that a successful pivot.

Key Takeaways

Repurpose what you have. Mining rigs make surprisingly good ML clusters.
Start simple. Data parallelism with DDP handles most use cases.
Profile before optimizing. Find the actual bottleneck.
Manage thermals. Sustained AI training runs hotter than mining.
Train on your own data. A model fine-tuned on your code is incredibly useful.

Running a homelab GPU cluster? Training your own models? I’d love to hear what you’re building. Reach out anytime.