You SSH into a remote server.
You start a long-running deep learning training job:
python train.py
Everything works. GPU utilization is high. Logs are printing.
Then your laptop sleeps.
Or WiFi drops.
Or you close the terminal.
You reconnect to the server…
Your training job is gone.
But on another server, your training continues even after you disconnect.
Why?
This is not random behavior. It is deterministic Unix session management.
Let’s break it down precisely.
The Root Cause: Controlling Terminals and SIGHUP
When you SSH into a Linux server:
- SSH creates a login session.
- A shell process (e.g.,
bash) is started. - That shell is attached to a controlling terminal (TTY).
- Any command you run becomes a child process of that shell.
Process tree example:
sshd
└── bash
└── python train.py
When your SSH session ends:
- The terminal disappears.
- The shell receives a
SIGHUP(Hang Up signal). - The shell forwards
SIGHUPto its child processes. - Your Python training process exits.
This is default POSIX behavior.
The operating system is doing exactly what it is designed to do.
Why It Works on Another Server
If training survives logout on another machine, one of these is true:
- It was started inside tmux
- It was started inside screen
- It was launched using
nohup - It was
disowned - It was started as a systemd service
- The SSH daemon is configured differently (rare but possible)
The most common reason: tmux is being used.
Real-World Scenario (Deep Learning Context)
Imagine this:
- You launch a 3-day transformer training run.
- You’re using expensive GPU hardware.
- Model checkpoints save every 2 hours.
- Your internet drops after 4 hours.
Without proper session handling:
- Training dies immediately.
- GPU memory is released.
- You lose progress since the last checkpoint.
- Compute time is wasted.
- Experiment reproducibility is impacted.
In production ML workflows, this is unacceptable.
Proper Solutions (Ranked by Maturity)
1️⃣ tmux — The Research Standard
tmux creates a persistent terminal multiplexer. Your training process attaches to tmux — not directly to SSH.
Install:
sudo apt install tmux
Start session:
tmux new -s training
Run training:
python train.py
Detach safely:
Ctrl + B, then D
Logout freely. Training continues.
Reconnect later:
tmux attach -t training
Why this works:
- tmux owns the controlling TTY.
- SSH disconnection does not kill tmux.
- Your process remains alive.
This is the recommended workflow for ML engineers and researchers.
2️⃣ nohup — Minimal Fix
nohup python train.py > train.log 2>&1 &
What it does:
- Ignores
SIGHUP - Redirects output
- Runs in background
Good for:
- Simple batch jobs
- Fire-and-forget scripts
Downside:
- No interactive recovery
- Harder process management
3️⃣ disown — Shell-Level Detachment
python train.py &
disown
This removes the process from the shell job table.
Less robust than tmux but works.
4️⃣ systemd — Production-Grade
For structured environments:
Create a service file:
/etc/systemd/system/ml-training.service
Then:
sudo systemctl daemon-reload
sudo systemctl start ml-training
Benefits:
- Auto restart on failure
- Resource control
- Logging via
journalctl - Reproducible startup
- Can run at boot
This is appropriate for:
- Production ML pipelines
- Enterprise GPU clusters
- Persistent inference services
How to Diagnose Your Current Setup
Before disconnecting, check:
ps -o pid,ppid,cmd -p <python_pid>
If the parent process (PPID) is your shell → it will die when SSH disconnects.
Check if you are inside tmux:
echo $TMUX
If empty → you are not using tmux.
Server Admin Configuration Considerations
From the system administration side, there are several configurations that impact user sessions.
1️⃣ SSH Daemon Settings
File:
/etc/ssh/sshd_config
Relevant options:
ClientAliveIntervalClientAliveCountMaxTCPKeepAlive
These control idle timeout behavior — but they do NOT prevent SIGHUP when a session closes.
They only affect when the connection is dropped.
2️⃣ systemd User Session Settings
Modern Linux distributions use systemd-logind.
File:
/etc/systemd/logind.conf
Key option:
KillUserProcesses=yes
If enabled:
- All user processes are killed when the session ends.
- Even tmux sessions may die.
To allow persistent processes:
KillUserProcesses=no
After change:
sudo systemctl restart systemd-logind
This setting is critical on shared GPU servers.
3️⃣ Resource Policies
Admins may enforce:
- cgroups limits
- SLURM job scheduler policies
- Idle job cleanup scripts
On managed clusters, processes outside scheduler control may be automatically terminated.
In HPC environments, users should launch jobs via:
sbatchsrunqsub- Kubernetes job manifests
Not direct SSH shells.
When Admin Intervention Is Required
If:
- tmux sessions still die on logout
- background processes terminate unexpectedly
- training stops even with nohup
Then the admin should verify:
-
KillUserProcessessetting - PAM session configuration
- Custom logout hooks
- Cluster job scheduler enforcement
- cgroup cleanup policies
On well-configured ML servers, tmux should work without special privileges.
Best Practice for ML Engineers
For experimentation:
- Always use
tmux - Enable periodic checkpointing
- Log to file
- Monitor GPU usage via
nvidia-smi
For production:
- Use systemd or a scheduler
- Configure restart policies
- Use structured logging
- Implement health checks
Final Takeaway
Your training stops because:
It is attached to a shell that receives SIGHUP when SSH disconnects.
This is expected behavior.
The solution is not “keep SSH open.”
The solution is:
- Decouple your process from the SSH session.
- Use tmux, nohup, or systemd appropriately.
- Ensure systemd-logind is not killing user processes.
Once configured correctly, your deep learning jobs should survive network instability, laptop sleep, and terminal closure — exactly as they should in a robust ML workflow.
Top comments (0)