Ajit Kumar

Posted on Feb 16

Why Your Deep Learning Job Dies After SSH Logout — A Practical Guide to Persistent Linux Sessions

#linux #devops #ssh

You SSH into a remote server.
You start a long-running deep learning training job:

python train.py

Everything works. GPU utilization is high. Logs are printing.

Then your laptop sleeps.
Or WiFi drops.
Or you close the terminal.

You reconnect to the server…

Your training job is gone.

But on another server, your training continues even after you disconnect.

Why?

This is not random behavior. It is deterministic Unix session management.

Let’s break it down precisely.

The Root Cause: Controlling Terminals and `SIGHUP`

When you SSH into a Linux server:

SSH creates a login session.
A shell process (e.g., bash) is started.
That shell is attached to a controlling terminal (TTY).
Any command you run becomes a child process of that shell.

Process tree example:

sshd
 └── bash
      └── python train.py

When your SSH session ends:

The terminal disappears.
The shell receives a SIGHUP (Hang Up signal).
The shell forwards SIGHUP to its child processes.
Your Python training process exits.

This is default POSIX behavior.

The operating system is doing exactly what it is designed to do.

Why It Works on Another Server

If training survives logout on another machine, one of these is true:

It was started inside tmux
It was started inside screen
It was launched using nohup
It was disowned
It was started as a systemd service
The SSH daemon is configured differently (rare but possible)

The most common reason: tmux is being used.

Real-World Scenario (Deep Learning Context)

Imagine this:

You launch a 3-day transformer training run.
You’re using expensive GPU hardware.
Model checkpoints save every 2 hours.
Your internet drops after 4 hours.

Without proper session handling:

Training dies immediately.
GPU memory is released.
You lose progress since the last checkpoint.
Compute time is wasted.
Experiment reproducibility is impacted.

In production ML workflows, this is unacceptable.

Proper Solutions (Ranked by Maturity)

1️⃣ tmux — The Research Standard

tmux creates a persistent terminal multiplexer. Your training process attaches to tmux — not directly to SSH.

Install:

sudo apt install tmux

Start session:

tmux new -s training

Run training:

python train.py

Detach safely:

Ctrl + B, then D

Logout freely. Training continues.

Reconnect later:

tmux attach -t training

Why this works:

tmux owns the controlling TTY.
SSH disconnection does not kill tmux.
Your process remains alive.

This is the recommended workflow for ML engineers and researchers.

2️⃣ nohup — Minimal Fix

nohup python train.py > train.log 2>&1 &

What it does:

Ignores SIGHUP
Redirects output
Runs in background

Good for:

Simple batch jobs
Fire-and-forget scripts

Downside:

No interactive recovery
Harder process management

3️⃣ disown — Shell-Level Detachment

python train.py &
disown

This removes the process from the shell job table.

Less robust than tmux but works.

4️⃣ systemd — Production-Grade

For structured environments:

Create a service file:

/etc/systemd/system/ml-training.service

Then:

sudo systemctl daemon-reload
sudo systemctl start ml-training

Benefits:

Auto restart on failure
Resource control
Logging via journalctl
Reproducible startup
Can run at boot

This is appropriate for:

Production ML pipelines
Enterprise GPU clusters
Persistent inference services

How to Diagnose Your Current Setup

Before disconnecting, check:

ps -o pid,ppid,cmd -p <python_pid>

If the parent process (PPID) is your shell → it will die when SSH disconnects.

Check if you are inside tmux:

echo $TMUX

If empty → you are not using tmux.

Server Admin Configuration Considerations

From the system administration side, there are several configurations that impact user sessions.

1️⃣ SSH Daemon Settings

File:

/etc/ssh/sshd_config

Relevant options:

ClientAliveInterval
ClientAliveCountMax
TCPKeepAlive

These control idle timeout behavior — but they do NOT prevent SIGHUP when a session closes.

They only affect when the connection is dropped.

2️⃣ systemd User Session Settings

Modern Linux distributions use systemd-logind.

File:

/etc/systemd/logind.conf

Key option:

KillUserProcesses=yes

If enabled:

All user processes are killed when the session ends.
Even tmux sessions may die.

To allow persistent processes:

KillUserProcesses=no

After change:

sudo systemctl restart systemd-logind

This setting is critical on shared GPU servers.

3️⃣ Resource Policies

Admins may enforce:

cgroups limits
SLURM job scheduler policies
Idle job cleanup scripts

On managed clusters, processes outside scheduler control may be automatically terminated.

In HPC environments, users should launch jobs via:

sbatch
srun
qsub
Kubernetes job manifests

Not direct SSH shells.

When Admin Intervention Is Required

If:

tmux sessions still die on logout
background processes terminate unexpectedly
training stops even with nohup

Then the admin should verify:

KillUserProcesses setting
PAM session configuration
Custom logout hooks
Cluster job scheduler enforcement
cgroup cleanup policies

On well-configured ML servers, tmux should work without special privileges.

Best Practice for ML Engineers

For experimentation:

Always use tmux
Enable periodic checkpointing
Log to file
Monitor GPU usage via nvidia-smi

For production:

Use systemd or a scheduler
Configure restart policies
Use structured logging
Implement health checks

Final Takeaway

Your training stops because:

It is attached to a shell that receives SIGHUP when SSH disconnects.

This is expected behavior.

The solution is not “keep SSH open.”

The solution is:

Decouple your process from the SSH session.
Use tmux, nohup, or systemd appropriately.
Ensure systemd-logind is not killing user processes.

Once configured correctly, your deep learning jobs should survive network instability, laptop sleep, and terminal closure — exactly as they should in a robust ML workflow.

DEV Community

Why Your Deep Learning Job Dies After SSH Logout — A Practical Guide to Persistent Linux Sessions

The Root Cause: Controlling Terminals and `SIGHUP`

Why It Works on Another Server

Real-World Scenario (Deep Learning Context)

Proper Solutions (Ranked by Maturity)

1️⃣ tmux — The Research Standard

2️⃣ nohup — Minimal Fix

3️⃣ disown — Shell-Level Detachment

4️⃣ systemd — Production-Grade

How to Diagnose Your Current Setup

Server Admin Configuration Considerations

1️⃣ SSH Daemon Settings

2️⃣ systemd User Session Settings

3️⃣ Resource Policies

When Admin Intervention Is Required

Best Practice for ML Engineers

Final Takeaway

Top comments (0)

The Root Cause: Controlling Terminals and SIGHUP

Why It Works on Another Server

Real-World Scenario (Deep Learning Context)

Proper Solutions (Ranked by Maturity)

1️⃣ tmux — The Research Standard

2️⃣ nohup — Minimal Fix

3️⃣ disown — Shell-Level Detachment

4️⃣ systemd — Production-Grade

How to Diagnose Your Current Setup

Server Admin Configuration Considerations

1️⃣ SSH Daemon Settings

2️⃣ systemd User Session Settings

3️⃣ Resource Policies

When Admin Intervention Is Required

Best Practice for ML Engineers

Final Takeaway

The Root Cause: Controlling Terminals and `SIGHUP`