Artifacts lost when lost communication with the server #170280
Replies: 2 comments
-
|
Hi @david-thrower, That’s a frustrating issue, losing artifacts and logs because the runner lost communication is a serious problem. Here are a few ideas + things to try: What’s happening (based on your description)
|
Beta Was this translation helpful? Give feedback.
-
|
You hit the nail on the head with your diagnosis this is almost certainly caused by the runner experiencing an Out of Memory (OOM) error. When you are running heavy workloads (like the hyperparameter optimization in your screenshot) and RAM is exhausted, the host OS's OOM killer steps in and forcefully terminates processes to keep the machine alive. Unfortunately, it often kills the GitHub Actions runner agent process itself. Here is a breakdown of why you are seeing those specific symptoms:
Recommendations for your ML workload: Short term: You likely need to switch to GitHub Larger Hosted Runners with more RAM to prevent the OOM crash. Architectural improvement: Instead of relying on runner stdout logs, you might want to integrate an external experiment tracker (like Weights & Biases, MLflow, or Comet). These libraries stream your trial metrics, hyperparameters, and checkpoints directly to an external server during the run. If the GitHub runner gets OOM killed on trial 8, the data for trials 1-7 is already safe on your tracking server. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Why are you starting this discussion?
Bug
What GitHub Actions topic or product is this about?
Metrics & Insights
Discussion Details
Artifacts are lost when the error "The hosted runner lost communication with the server" occurs with a job.
https://github.com/david-thrower/cerebros-core-algorithm-alpha/actions/runs/17053636900/job/48346863287failed out probably becasue of RAM being exhausted.["Gear button"] > ["view raw logs"]returns the error: "Failed to generate URL to download logs."Additionally: Inconsistancies in the run status as presented by the UI
timeout-minutes: 420.Beta Was this translation helpful? Give feedback.
All reactions