DEV Community

Cover image for Why Your Deep Learning Job Dies After SSH Logout — A Practical Guide to Persistent Linux Sessions
Ajit Kumar
Ajit Kumar

Posted on

Why Your Deep Learning Job Dies After SSH Logout — A Practical Guide to Persistent Linux Sessions

You SSH into a remote server.
You start a long-running deep learning training job:

python train.py
Enter fullscreen mode Exit fullscreen mode

Everything works. GPU utilization is high. Logs are printing.

Then your laptop sleeps.
Or WiFi drops.
Or you close the terminal.

You reconnect to the server…

Your training job is gone.

But on another server, your training continues even after you disconnect.

Why?

This is not random behavior. It is deterministic Unix session management.

Let’s break it down precisely.


The Root Cause: Controlling Terminals and SIGHUP

When you SSH into a Linux server:

  1. SSH creates a login session.
  2. A shell process (e.g., bash) is started.
  3. That shell is attached to a controlling terminal (TTY).
  4. Any command you run becomes a child process of that shell.

Process tree example:

sshd
 └── bash
      └── python train.py
Enter fullscreen mode Exit fullscreen mode

When your SSH session ends:

  • The terminal disappears.
  • The shell receives a SIGHUP (Hang Up signal).
  • The shell forwards SIGHUP to its child processes.
  • Your Python training process exits.

This is default POSIX behavior.

The operating system is doing exactly what it is designed to do.


Why It Works on Another Server

If training survives logout on another machine, one of these is true:

  • It was started inside tmux
  • It was started inside screen
  • It was launched using nohup
  • It was disowned
  • It was started as a systemd service
  • The SSH daemon is configured differently (rare but possible)

The most common reason: tmux is being used.


Real-World Scenario (Deep Learning Context)

Imagine this:

  • You launch a 3-day transformer training run.
  • You’re using expensive GPU hardware.
  • Model checkpoints save every 2 hours.
  • Your internet drops after 4 hours.

Without proper session handling:

  • Training dies immediately.
  • GPU memory is released.
  • You lose progress since the last checkpoint.
  • Compute time is wasted.
  • Experiment reproducibility is impacted.

In production ML workflows, this is unacceptable.


Proper Solutions (Ranked by Maturity)

1️⃣ tmux — The Research Standard

tmux creates a persistent terminal multiplexer. Your training process attaches to tmux — not directly to SSH.

Install:

sudo apt install tmux
Enter fullscreen mode Exit fullscreen mode

Start session:

tmux new -s training
Enter fullscreen mode Exit fullscreen mode

Run training:

python train.py
Enter fullscreen mode Exit fullscreen mode

Detach safely:

Ctrl + B, then D
Enter fullscreen mode Exit fullscreen mode

Logout freely. Training continues.

Reconnect later:

tmux attach -t training
Enter fullscreen mode Exit fullscreen mode

Why this works:

  • tmux owns the controlling TTY.
  • SSH disconnection does not kill tmux.
  • Your process remains alive.

This is the recommended workflow for ML engineers and researchers.


2️⃣ nohup — Minimal Fix

nohup python train.py > train.log 2>&1 &
Enter fullscreen mode Exit fullscreen mode

What it does:

  • Ignores SIGHUP
  • Redirects output
  • Runs in background

Good for:

  • Simple batch jobs
  • Fire-and-forget scripts

Downside:

  • No interactive recovery
  • Harder process management

3️⃣ disown — Shell-Level Detachment

python train.py &
disown
Enter fullscreen mode Exit fullscreen mode

This removes the process from the shell job table.

Less robust than tmux but works.


4️⃣ systemd — Production-Grade

For structured environments:

Create a service file:

/etc/systemd/system/ml-training.service
Enter fullscreen mode Exit fullscreen mode

Then:

sudo systemctl daemon-reload
sudo systemctl start ml-training
Enter fullscreen mode Exit fullscreen mode

Benefits:

  • Auto restart on failure
  • Resource control
  • Logging via journalctl
  • Reproducible startup
  • Can run at boot

This is appropriate for:

  • Production ML pipelines
  • Enterprise GPU clusters
  • Persistent inference services

How to Diagnose Your Current Setup

Before disconnecting, check:

ps -o pid,ppid,cmd -p <python_pid>
Enter fullscreen mode Exit fullscreen mode

If the parent process (PPID) is your shell → it will die when SSH disconnects.

Check if you are inside tmux:

echo $TMUX
Enter fullscreen mode Exit fullscreen mode

If empty → you are not using tmux.


Server Admin Configuration Considerations

From the system administration side, there are several configurations that impact user sessions.

1️⃣ SSH Daemon Settings

File:

/etc/ssh/sshd_config
Enter fullscreen mode Exit fullscreen mode

Relevant options:

  • ClientAliveInterval
  • ClientAliveCountMax
  • TCPKeepAlive

These control idle timeout behavior — but they do NOT prevent SIGHUP when a session closes.

They only affect when the connection is dropped.


2️⃣ systemd User Session Settings

Modern Linux distributions use systemd-logind.

File:

/etc/systemd/logind.conf
Enter fullscreen mode Exit fullscreen mode

Key option:

KillUserProcesses=yes
Enter fullscreen mode Exit fullscreen mode

If enabled:

  • All user processes are killed when the session ends.
  • Even tmux sessions may die.

To allow persistent processes:

KillUserProcesses=no
Enter fullscreen mode Exit fullscreen mode

After change:

sudo systemctl restart systemd-logind
Enter fullscreen mode Exit fullscreen mode

This setting is critical on shared GPU servers.


3️⃣ Resource Policies

Admins may enforce:

  • cgroups limits
  • SLURM job scheduler policies
  • Idle job cleanup scripts

On managed clusters, processes outside scheduler control may be automatically terminated.

In HPC environments, users should launch jobs via:

  • sbatch
  • srun
  • qsub
  • Kubernetes job manifests

Not direct SSH shells.


When Admin Intervention Is Required

If:

  • tmux sessions still die on logout
  • background processes terminate unexpectedly
  • training stops even with nohup

Then the admin should verify:

  1. KillUserProcesses setting
  2. PAM session configuration
  3. Custom logout hooks
  4. Cluster job scheduler enforcement
  5. cgroup cleanup policies

On well-configured ML servers, tmux should work without special privileges.


Best Practice for ML Engineers

For experimentation:

  • Always use tmux
  • Enable periodic checkpointing
  • Log to file
  • Monitor GPU usage via nvidia-smi

For production:

  • Use systemd or a scheduler
  • Configure restart policies
  • Use structured logging
  • Implement health checks

Final Takeaway

Your training stops because:

It is attached to a shell that receives SIGHUP when SSH disconnects.

This is expected behavior.

The solution is not “keep SSH open.”

The solution is:

  • Decouple your process from the SSH session.
  • Use tmux, nohup, or systemd appropriately.
  • Ensure systemd-logind is not killing user processes.

Once configured correctly, your deep learning jobs should survive network instability, laptop sleep, and terminal closure — exactly as they should in a robust ML workflow.


Top comments (0)