The content on this page was provided by an independent third party and syndicated by XPR Media. Members of the editorial and news staff of the USA TODAY Network were not involved in the creation of this content.

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics– a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

“Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant,” said Suresh Vasudevan, CEO of Clockwork.io. “We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure.”

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

“As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra’s NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable,” said Patel. “TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics.”

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable – making this a major barrier to scaling AI’s impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

“Managing compute output across large-scale GPU clusters is vital to ensuring we’re delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations,” said David Power, CTO of Nscale. “In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale.”

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis’ independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

“In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective,” concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io’s prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io’s Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world’s most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
clockwork@unshakablemarketinggroup.com
650-269-7478

SOURCE: Clockwork

View the original press release on ACCESS Newswire

Information contained on this page is provided by an independent third-party content provider. XPRMedia and this Site make no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact pressreleases@xpr.media

International Association for Near-Death Studies Spring Symposium to Be Held On-Line April 25

International Association for Near-Death Studies Spring Symposium to Be Held On-Line April 25

Continuing Education Credit Offered for 2026 Symposium Focusing on Intersection of Near-Death Experiences and Suicide Attendees will gain knowledge and skills to help those who…

March 11, 2026

Garden Media Group’s 2026 Superstars for Spring Reveals New Products and Plants for This Season

Garden Media Group’s 2026 Superstars for Spring Reveals New Products and Plants for This Season

Gardening Made Effortless for Beginners and Pros Alike Our 2026 Superstars bring fresh inspiration to every backyard” — Katie Dubow, President, Garden Media Group PHILADELPHIA,…

March 11, 2026

RegulatingAI Podcast with Sanjay Puri: Maya Sherman on AI Governance, Global South Challenges, & India’s Policy Approach

RegulatingAI Podcast with Sanjay Puri: Maya Sherman on AI Governance, Global South Challenges, & India’s Policy Approach

On the RegulatingAI Podcast, host Sanjay Puri speaks with Maya Sherman on AI governance, India’s policy approach & why inclusive, ethical AI matters globally. Responsible…

March 11, 2026

DESIGN COMPONENTS, INC. TO EXHIBIT AT THE 57TH ANNUAL MBCEA CONFERENCE IN COLORADO SPRINGS

DESIGN COMPONENTS, INC. TO EXHIBIT AT THE 57TH ANNUAL MBCEA CONFERENCE IN COLORADO SPRINGS

Design Components, Inc. joins top metal building professionals at the 57th Annual MBCEA Conference to highlight innovative component solutions. MBCEA is an excellent opportunity to…

March 11, 2026

New App Aims to Standardize Before-and-After Imaging in Aesthetic Medicine

New App Aims to Standardize Before-and-After Imaging in Aesthetic Medicine

Me.Dea mobile app helps physicians document and track patient outcomes; selected as finalist at the Aesthetic & Anti-Aging Medicine World Congress (AMWC) 2026 MONACO, March…

March 11, 2026

Dr. Paul Savage to Deliver Keynote on Advanced Therapeutic Plasma Exchange at PMC Course 4

Dr. Paul Savage to Deliver Keynote on Advanced Therapeutic Plasma Exchange at PMC Course 4

MDLifespan founder joins Dr. Pamela W. Smith and leading faculty to advance physician education on toxin burden & Advanced Serial Therapeutic Plasma Exchange. Environmental exposures…

March 11, 2026

Worksport Announces Fourth Quarter and Full Year 2025 Earnings Date; Updated Financial Guidance and Path to Cash-Flow Positivity to Be Discussed

Worksport Announces Fourth Quarter and Full Year 2025 Earnings Date; Updated Financial Guidance and Path to Cash-Flow Positivity to Be Discussed

Conference call expected to provide additional details on the Company’s path to cash-flow positivity and key operational milestones. WEST SENECA, NY / ACCESS Newswire /…

March 11, 2026

IFS, Oracle, and Microsoft Partner Re-Quest, Inc. Attains the 3-Peat Being Named to CRN MSP 500 List 3rd Year in a Row

IFS, Oracle, and Microsoft Partner Re-Quest, Inc. Attains the 3-Peat Being Named to CRN MSP 500 List 3rd Year in a Row

Re-Quest, Inc. was Named to CRN MSP 500 List in the Pioneer 250 Category for the 3rd Time in a Row in 2026 We cannot…

March 11, 2026

TRULEO Becomes First Company in the Nation to Launch Agentic License Plate Reader Skills for Law Enforcement

TRULEO Becomes First Company in the Nation to Launch Agentic License Plate Reader Skills for Law Enforcement

AI agents transform license plate data into actionable intelligence in seconds, delivering digital labor to understaffed police departments. License plate reader data is extremely powerful,…

March 11, 2026

New AI Learning Platform Helps Students and Workers Build Skills for a Fast-Evolving AI Economy

New AI Learning Platform Helps Students and Workers Build Skills for a Fast-Evolving AI Economy

StudyFetch, a technology company building AI-native learning products, today announced the launch of Honen, a new workforce training platform. Everyone deserves access to a personal…

March 11, 2026

Menopause Education Center & Senator Lori Urso Release Menopause Legislation Report, Announce March 18 Event

Menopause Education Center & Senator Lori Urso Release Menopause Legislation Report, Announce March 18 Event

New analysis tracks state and federal policy momentum protecting working women in midlife; LinkedIn Live on March 18 unites key lawmakers and advocates. Employers can…

March 11, 2026

Attorney Achchana (‘A.C.’) Ranasinghe Joins Nationally Acclaimed Law Firm Brown, LLC

Attorney Achchana (‘A.C.’) Ranasinghe Joins Nationally Acclaimed Law Firm Brown, LLC

Noted law firm Brown, LLC, a firm active in high profile litigation nationwide once again is adding another attorney to its ranks. PA, UNITED STATES,…

March 11, 2026

Damsels of Design: How Women Changed American Car Design in 1958

Damsels of Design: How Women Changed American Car Design in 1958

BTLPR’s Sean Hixson Tells Their Story — Including the Iconic Fancy Free Corvette — in National Corvette Museum’s Official Magazine They were doing user-centered design…

March 11, 2026

Legal Tax Defense Reports Surge in IRS Wage Garnishment Cases: Firm Urges Taxpayers to Act Before Collections Begin

Legal Tax Defense Reports Surge in IRS Wage Garnishment Cases: Firm Urges Taxpayers to Act Before Collections Begin

TUSTIN, CA, UNITED STATES, March 11, 2026 /EINPresswire.com/ — Legal Tax Defense is urging taxpayers to take immediate action when facing IRS wage garnishment, emphasizing…

March 11, 2026

Brooklyn Dylan Ignites Rock Buzz With Cover of Yungblud’s ‘Zombie’

Brooklyn Dylan Ignites Rock Buzz With Cover of Yungblud’s ‘Zombie’

LOS ANGELES, CA, UNITED STATES, March 11, 2026 /EINPresswire.com/ — Teen rock breakout Brooklyn Dylan is turning heads in the rock world after teasing a…

March 11, 2026

U: The Mind Company Launches Non-Invasive Brain Stimulation Device Requiring No Surgery

U: The Mind Company Launches Non-Invasive Brain Stimulation Device Requiring No Surgery

Ohio Startup Opens Orders for At-Home Cognitive Enhancement Device While Advancing Separate Clinical Trials U’s proprietary amplitude-modulated transcranial pulsed random noise stimulation (am-tPRNS) is “decades…

March 11, 2026

ChIPs NextGen Summit to Feature FTC Commissioners

ChIPs NextGen Summit to Feature FTC Commissioners

Slaughter and McSweeny will be keynote speakers on April 16, 2026 in Arlington, VA. Having two Federal Trade Commissioners share their expertise with our emerging…

March 11, 2026

Televero Behavioral Health Responds to Mental Health Chatbot Trend with Same-Day Access to Licensed Clinicians

Televero Behavioral Health Responds to Mental Health Chatbot Trend with Same-Day Access to Licensed Clinicians

Chatbots are filling a gap the mental health system created. Televero Behavioral Health is working to close it. People who already know what therapy feels…

March 11, 2026

From Delhi to Digital Planet: Bhaskar Chakravorti’s Journey — Insights from The Indianness Podcast with Host Sanjay Puri

From Delhi to Digital Planet: Bhaskar Chakravorti’s Journey — Insights from The Indianness Podcast with Host Sanjay Puri

Bhaskar Chakravorti shares his journey on The Indianness Podcast with Sanjay Puri—exploring education, leadership & how business & policy shape nation-building. Your real education often…

March 11, 2026

Investigative Journalist Reports Breakthrough in Stephen Smith Case; New Book Heartbreak on Sandy Run Road

Investigative Journalist Reports Breakthrough in Stephen Smith Case; New Book Heartbreak on Sandy Run Road

Journalist James Seidel reports new witness account identifying alleged suspects in the 2015 death of Stephen Smith as investigation continues. CHARLESTON, SC, SC, UNITED STATES,…

March 11, 2026

TOMORROW.CITY USA BRINGS THE WORLD’S LEADING URBAN INNOVATION SUMMIT TO WEST PALM BEACH ON APRIL 14-15

TOMORROW.CITY USA BRINGS THE WORLD’S LEADING URBAN INNOVATION SUMMIT TO WEST PALM BEACH ON APRIL 14-15

2,000+ Public, Private, and Philanthropic Leaders to Convene at Palm Beach County Convention Center Following Successful Events in Atlanta, Miami, and NYC WEST PALM BEACH,…

March 11, 2026

97% Retention. Zero Vendor Kickbacks. 10 Years of Doing HOA Management Right

97% Retention. Zero Vendor Kickbacks. 10 Years of Doing HOA Management Right

Lifetime HOA Management celebrates a decade of growth built on community managers who aren’t spread thin, 24-hour response times, and zero vendor kickbacks. We started…

March 11, 2026

Wellput Introduces Performance-Focused Approach to Newsletter Sponsorships for Modern Advertisers

Wellput Introduces Performance-Focused Approach to Newsletter Sponsorships for Modern Advertisers

AUSTIN, TX, UNITED STATES, March 11, 2026 /EINPresswire.com/ — As digital advertising costs continue to rise across traditional channels, marketers are searching for acquisition strategies…

March 11, 2026

3 Strand Sports & Entertainment Announces Majority Ownership by The Mintz Group

3 Strand Sports & Entertainment Announces Majority Ownership by The Mintz Group

3 Strand Sports & Entertainment is pleased to announce The Mintz Group (TMG) has made an additional investment in the company, securing majority ownership. Eugene…

March 11, 2026

Siam Legal International Issues Advisory on Thailand’s Strengthened Criminal Defense Requirements Under New Sexual Harassment Law

Siam Legal International Issues Advisory on Thailand’s Strengthened Criminal Defense Requirements Under New Sexual Harassment Law

Bangkok, Thailand – March 11, 2026 – PRESSADVANTAGE – Siam Legal International, a Bangkok-based law firm with more than 22 years of legal practice in…

March 11, 2026

Becker Transactions Announces Exclusive Sale of Next-Generation LED Packaging Patent Portfolio

Becker Transactions Announces Exclusive Sale of Next-Generation LED Packaging Patent Portfolio

STARKVILLE, MS, UNITED STATES, March 11, 2026 /EINPresswire.com/ — Becker Transactions today announced the exclusive market launch of a patented next-generation LED packaging portfolio. The…

March 11, 2026

PortOptix Co-Founder Jeff Sklar Recognized as a Lawdragon Leading AI & Legal Tech Advisor

PortOptix Co-Founder Jeff Sklar Recognized as a Lawdragon Leading AI & Legal Tech Advisor

LOS ANGELES, CA, UNITED STATES, March 11, 2026 /EINPresswire.com/ — PortOptix, the first AI value creation platform designed specifically for private equity funds and their…

March 11, 2026

Lounge Lizard Worldwide Announces AI Commerce Optimization Framework as Direct Checkout in Search Accelerates

Lounge Lizard Worldwide Announces AI Commerce Optimization Framework as Direct Checkout in Search Accelerates

LONG ISLAND, NY, UNITED STATES, March 11, 2026 /EINPresswire.com/ — Lounge Lizard Worldwide Inc., a leading digital marketing agency, released a new thought leadership piece…

March 11, 2026

nOps Joins FinOps Foundation As A Premier Member

nOps Joins FinOps Foundation As A Premier Member

nOps has joined the FinOps Foundation as a Premier Member. CEO of nOps, JT Giri, has also joined the FinOps Foundation Governing Board. We’re excited…

March 11, 2026

Influential Women Podcast Announces New Episode with Sheena Yap-Chan: Bridging the Confidence Gap and Becoming Visible

Influential Women Podcast Announces New Episode with Sheena Yap-Chan: Bridging the Confidence Gap and Becoming Visible

Honest conversation exploring how women can close the confidence gap by taking imperfect action and owning their achievements ST. PETERSBURG, FL, UNITED STATES, March 11,…

March 11, 2026

Fenris Expands Insurance Data Infrastructure with MCP Server for AI Integrations

Fenris Expands Insurance Data Infrastructure with MCP Server for AI Integrations

New architecture enables AI agents and automation systems to securely access Fenris insurance intelligence through unified protocols Agentic AI proxies or action-capable assistants are becoming…

March 11, 2026

Sunspear Energy Named Hawai’i’s EnergySage Installer of the Year for Second Consecutive Year

Sunspear Energy Named Hawai’i’s EnergySage Installer of the Year for Second Consecutive Year

Winning EnergySage Installer of the Year again is an incredible honor, reflecting the dedication of our employees, our commitment to quality installations, and the trust…

March 11, 2026

Der umfassende Ratgeber in 2026 für Wohnungsverkäufer in Berlin

Der umfassende Ratgeber in 2026 für Wohnungsverkäufer in Berlin

Der Ratgeber 2026 für Eigentümer: Tipps zum Wohnung verkaufen in Berlin, aktuelle Marktpreise, Verkaufsablauf und Hinweise zu Privatverkauf oder Makler. BERLIN, BERLIN, GERMANY, March 11,…

March 11, 2026

WildBird Launches Nationwide in Target Stores, Debuts Exclusive Sugarbird Carrier

WildBird Launches Nationwide in Target Stores, Debuts Exclusive Sugarbird Carrier

SALT LAKE CITY, UT, UNITED STATES, March 11, 2026 /EINPresswire.com/ — WildBird, the beloved babywearing brand known for its elevated designs and functionality, is proud…

March 11, 2026

The Holy Link of the God–Human–Animal Bond by Dr. Ashley Cooper Explores Connection Between Humans, Animals, and God

The Holy Link of the God–Human–Animal Bond by Dr. Ashley Cooper Explores Connection Between Humans, Animals, and God

A thoughtful and faith-centered exploration of how relationships with animals can deepen spiritual awareness and reveal the presence of God in everyday life. NEW YORK…

March 11, 2026

IPT Hotels at Sea Polo Team Officially Advances to April 3rd USPA Masters Cup Arena Polo Money Tournament Finals

IPT Hotels at Sea Polo Team Officially Advances to April 3rd USPA Masters Cup Arena Polo Money Tournament Finals

WELLINGTON, FL, UNITED STATES, March 11, 2026 /EINPresswire.com/ — This weekend, the International Polo Tour® (IPT) Hotels at Sea polo team took new steps towards…

March 11, 2026

Sphera Named a Leader in the 2026 Green Quadrant for Enterprise Carbon Management Software

Sphera Named a Leader in the 2026 Green Quadrant for Enterprise Carbon Management Software

In the report, Sphera was recognized for excellence in enterprise carbon intelligence, advanced product foot printing and value chain execution capabilities CHICAGO, IL, UNITED STATES,…

March 11, 2026

WBGO’s Steven A. Williams, Saxophonist Lakecia Benjamin, and Council Member Carmen De La Rosa Named JPI 2026 Honorees

WBGO’s Steven A. Williams, Saxophonist Lakecia Benjamin, and Council Member Carmen De La Rosa Named JPI 2026 Honorees

Jazz Power Initiative (JPI) Syncopated Celebration 2026 recognizes jazz’s transformative, historic, and influential impact on global society and culture Being honored by Jazz Power Initiative…

March 11, 2026

From Speculation To Verified Digital Assets: SMX Brings Real-World Commodities Into The Blockchain Era As Global Markets Demand Proof

From Speculation To Verified Digital Assets: SMX Brings Real-World Commodities Into The Blockchain Era As Global Markets Demand Proof

In an environment of geopolitical volatility and supply chain uncertainty, SMX’s digital infrastructure is transforming physical materials into authenticated, blockchain-ready assets with measurable economic value….

March 11, 2026

Tenstorrent Unveils TT-QuietBox(TM) 2, the First RISC-V AI Workstation With a Fully Open-Source Stack to Deliver Teraflop-Class Inference

Tenstorrent Unveils TT-QuietBox(TM) 2, the First RISC-V AI Workstation With a Fully Open-Source Stack to Deliver Teraflop-Class Inference

Liquid-Cooled Desktop System Runs Models up to 120B Parameters Locally With a Fully Open-Source Stack, Starting at $9,999 SANTA CLARA, CA / ACCESS Newswire /…

March 11, 2026