Product Manager - Networking

3h3 hours ago

Fluidstack

San Francisco, US · Full-time · $175,000 – $275,000

About this role

Fluidstack exists to make humanity more free by building civilization-scale infrastructure for AI. We are hiring a Product Manager to own the tools and systems our team uses to design, deploy, operate, and remediate the networks that run our GPU clusters. Speed and scale are our key differentiators.

You will own the product roadmap for all internal networking tooling: design automation, provisioning, observability, and incident remediation workflows. You will drive strategy for digital twin tooling and BOM generators. You will define observability stack requirements for network telemetry ingestion.

You will work alongside network engineers and site operations to map the full lifecycle of a network event from detection through remediation. You will partner with infrastructure and software engineering teams to integrate networking tooling into the broader cluster lifecycle. The networking team trusts your judgment because you have earned it technically.

This is not a role for someone who hands requirements to engineers and waits. You will be the person with the clearest opinion on what needs to be built and why the current state is broken. Come be a part of building civilization-scale infrastructure for AI.

Requirements

Working mental model of how a 400G spine-leaf fabric is cabled and what gRPC-based telemetry looks like at 10,000 devices.
Hands-on experience with network gear, streaming telemetry, or large-scale fabric automation.
Fluent in the underlying technology of network design, automation, and configuration generation.
Understanding of why config generation is harder than it sounds, including correctness guarantees and rollback support.
Experience with InfiniBand and RoCEv2 congestion patterns, all-reduce bottlenecks, and east-west bandwidth profiling.
Ability to own a product roadmap and drive technical strategy with clear opinions on architecture and priorities.
Proven track record of earning technical trust from engineers through hands-on knowledge and judgment.

Responsibilities

Own the product roadmap for all internal networking tooling: design automation, provisioning, observability, performance analysis, and incident remediation workflows across frontend, backend, OOB, and BMS networks.
Drive the strategy and requirements for digital twin tooling that models physical fabric topology, enabling engineers to validate designs, simulate failures, and test config changes before touching production.
Define and ship BOM generators that produce accurate, version-controlled bills of materials for frontend Ethernet, backend Ethernet, InfiniBand, and OOB networks tied directly to cluster topology specs.
Own the configuration generation pipeline: translate high-level cluster designs into device-ready configs across switches, routers, and OOB management infrastructure, with correctness guarantees and rollback support.
Build the observability stack requirements for network telemetry ingestion (gNMI, SNMP, streaming) into dashboards and alerting systems that give operators sub-minute visibility into fabric health and performance degradation.
Define performance profiling tooling that surfaces InfiniBand and RoCEv2 congestion, all-reduce bottlenecks, and east-west bandwidth saturation at the GPU job level.
Work with network engineers and site operations to map the full lifecycle of a network event from detection through remediation, then build the tooling that compresses mean time to resolution.
Partner with infrastructure and software engineering teams to integrate networking tooling into the broader cluster lifecycle.

Benefits

Impact: work on building civilization-scale infrastructure for AI that expands human freedom.
Ownership: own the product roadmap for critical networking tooling across the entire fabric stack.
Technical challenge: deep dive into large-scale fabric automation, digital twins, and observability at 10,000+ devices.
Autonomy: be the person with the clearest opinion on what needs to be built and why the current state is broken.