Anurag Gupta

All posts

Building a workflow execution service from scratch in Go

2026-04-28

Conquest is a workflow execution service. You POST a workflow (Nextflow, WDL, or Snakemake), it runs on Kubernetes, and you get back results and logs through GA4GH-standard APIs. I've been building it since March, and it's the most technically interesting project I've worked on in a while.

Why build it

I work on a platform where researchers submit scientific pipelines. The execution layer we had was tightly coupled to Nextflow and to our internal infrastructure conventions. I wanted something language-agnostic that exposed standard APIs (WES for workflow submission, TES for individual task execution), ran as a single binary, and could be deployed independently. That last constraint turned out to be the most important architectural driver, because you can't hide complexity behind a service mesh when there is no mesh.

One binary, three roles

The binary takes a --role flag: api, controller, or all. In production you'd run multiple API replicas behind a load balancer and one or two controller instances. For local development, --role all starts everything in one process.

The API server is a standard Go HTTP server using chi for routing. It implements the GA4GH WES v1.1 spec: POST /runs (submit), GET /runs/{id} (status), GET /runs/{id}/status (lite status), GET /runs (list), DELETE /runs/{id} (cancel). Request validation uses a generated OpenAPI validator so the spec is the source of truth for what's accepted.

The controller is a goroutine that polls Postgres for runs in actionable states (QUEUED, RUNNING, CANCELING) on a configurable interval (default 5 seconds). For each run, it calls the appropriate adapter to advance state. This is a pull-based reconciliation model rather than an event-driven one. Pull is simpler to reason about: if the controller crashes and restarts, it picks up where it left off without needing to replay events. The tradeoff is latency (up to one poll interval between submission and execution start), which is acceptable for workflows that run for minutes to hours.

The adapter interface

Each workflow language needs different handling. The adapter interface is small:

type Adapter interface {
    Prepare(ctx context.Context, run *model.Run, workDir string) error
    BuildJob(run *model.Run, workDir string) (*batchv1.Job, error)
    ParseOutputs(run *model.Run, workDir string) ([]model.Output, error)
    ChildPods(ctx context.Context, run *model.Run) ([]corev1.Pod, error)
}

Prepare stages input files and writes language-specific config. For Nextflow, it writes nextflow.config with the executor set to k8s and the work directory pointed at the S3 bucket. For WDL, it writes the inputs JSON with file paths resolved to S3 URIs. For Snakemake, it writes a profile YAML with the Kubernetes executor plugin config.

BuildJob returns a Kubernetes Job spec. The Job runs the workflow engine binary (nextflow run, miniwdl run, snakemake --kubernetes) in a container that has the engine installed plus access to the cluster's service account for spawning child pods. The Job's restartPolicy is Never because workflow engines have their own retry logic and we don't want Kubernetes retrying a half-completed pipeline.

ChildPods is where it gets interesting. Nextflow with the k8s executor doesn't run the entire pipeline in one pod. It spawns a pod per process invocation. So a "running" Nextflow job might have 50 child pods across the cluster. The controller needs to track them to report accurate status. ChildPods lists pods with a label selector that each engine sets on its child pods (Nextflow uses nextflow.io/runName, for example). The controller aggregates child pod statuses to decide if the parent run is still RUNNING, has COMPLETED, or has FAILED.

ParseOutputs reads the workflow engine's output metadata after completion. Nextflow writes a .nextflow.log and publishes outputs to a configured directory. WDL writes an outputs.json. Snakemake's outputs are just the files matching the target rule's output pattern. The adapter normalizes these into a common []Output struct.

State machine in Postgres

Every run has a row in a runs table. The state transitions are: QUEUED -> INITIALIZING -> RUNNING -> COMPLETE | EXECUTOR_ERROR | SYSTEM_ERROR | CANCELED. Transitions are enforced in Go code (a validTransitions map), and every state change is a single UPDATE with a WHERE clause that includes the current state, so two controller replicas can't race on the same transition.

I considered CRDs (Custom Resource Definitions) for state, which is what many Kubernetes operators do. Postgres won because the query patterns are relational and predictable: "give me all runs in state X older than Y created by user Z." That's a WHERE clause with an index, not a custom informer with a cache. The tradeoff is that the controller needs a database connection, but it already needs one for the WES API, so there's no new dependency.

Logs are stored as a JSONB column that accumulates entries: controller log lines, child pod events, engine stdout/stderr chunks. The WES spec's GET /runs/{id} returns a run_log field, and this column maps directly to it.

File staging

This was the unexpected time sink. Workflows reference input files, and those files need to be in the right place before execution starts. "The right place" is an S3-compatible bucket most of the time, but the files arrive as:

  1. HTTP/HTTPS URLs (download and stage to S3)
  2. s3:// URIs (already in the right place, maybe, or in a different bucket that needs cross-bucket copy)
  3. Inline content in the workflow attachment (base64-encoded in the WES submission, needs to be written to S3)
  4. Relative paths referencing other files in the submission bundle

The staging layer resolves all of these into a consistent S3 prefix: s3://{bucket}/runs/{run_id}/inputs/. Downloads use a worker pool (configurable concurrency, default 4) because genomics input files can be tens of gigabytes. Each download is checksummed (MD5, because S3 uses MD5 for ETag verification) and retried on failure with exponential backoff.

The cleanup layer runs on terminal state transitions (COMPLETE, EXECUTOR_ERROR, CANCELED). It deletes the run's S3 prefix for intermediate files but preserves the outputs prefix. Cleanup is async (fires a goroutine) so it doesn't block the state transition response.

What's left

Multi-tenant isolation (right now every run uses the same Kubernetes namespace and service account), quotas (CPU/memory limits per user), and a proper TES API implementation (currently the TES layer is internal-only). It's MIT-licensed and the README is intentionally detailed about architecture, because I'd like people to actually deploy it rather than just star it.