A Practical Linux and Remote Computing Handbook for Scientists
- Introduction: what this handbook is for
- Foundations: core ideas before using remote Linux systems
- Remote access: connect securely to distant machines
- Files and data movement: move, compress, and organize data
-
Shell essentials: understand what the terminal is doing
- Environment variables and PATH: understand where commands come from
- which / type / command -v: inspect what command actually runs
- PS1 prompt: show user, host, and directory clearly
- Prompt distinction: avoid mixing local and remote shells
- Terminal habits: small commands that save time every day
- grep: search logs, configs, and scripts quickly
- Pipes: combine small commands into useful workflows
- Redirection: save output and errors separately
- Exit codes and chaining: react to success or failure
- Background jobs: run commands without blocking the shell
- Working effectively on remote machines: stay productive once connected
- Environments, scripting, and reproducibility: make work easier to rerun and trust
- Cluster-specific topics: what changes on shared HPC systems
- Safety: reduce the risk of costly command-line mistakes
- Final thoughts: why these skills matter early
Introduction
Scientific work rarely happens on a single machine anymore. You may write code on your laptop, preprocess data on a workstation, launch long computations on a remote server, visualize results on another machine, and submit heavy jobs to a cluster. Even when the science itself is clear, the operational side can still be messy: where am I working right now? Which machine is running this code? Why does a file exist on one system but not another? Why does something work on campus but not from home? Why did my process stop when my connection dropped?
Many researchers are never taught these things systematically. They learn fragments from colleagues, old lab notes, shell history, or trial and error. The result is often functional but fragile. People get by, but they waste time and make avoidable mistakes.
This short handbook-style article is meant to fill that gap. It is not a Linux course and not a system administration guide. It is a practical handbook for scientists who work across laptops, local Linux machines, remote servers, institutional infrastructure, and sometimes clusters. Most of the ideas apply broadly.
Each section follows the same logic: why it matters, how it works, a practical example, and a common mistake.
Local machine vs remote machine
Why it matters
Many beginner mistakes stem from forgetting whether a command is running on the local computer or another machine.
- Where files are located
- Which software is installed
- Where outputs are written
- Which CPU and memory are being used
- What a destructive command will actually affect
A command like rm, mv, python,
or tar is not dangerous or safe by itself. It depends
entirely on where you run it.
How it works
Your local machine is the laptop or desktop in front of you. A remote machine is another computer you access through the network. Once you connect to it through SSH, commands typed in that shell are executed on the remote side.
Two commands are especially useful:
pwd
hostname
pwd shows the current directory.
hostname shows which machine you are on.
If you feel unsure, these commands restore context immediately.
Practical example
You start on your laptop:
hostname
which may return:
thinkpad
Then you connect:
ssh myserver
and now:
hostname
may return:
server01
From that point onward, commands are acting on server01,
not on your laptop.
Common mistake
Thinking that a terminal window belongs to one machine forever. It does not. The terminal is only the interface. The actual execution context is defined by the shell session you are in.
SSH
Why it matters
SSH is the standard tool for secure command-line access to a remote Linux machine. If you use a lab workstation, departmental server, personal server, or cluster login node, SSH is usually the main entry point.
Without SSH literacy, remote scientific computing remains uncomfortable.
How it works
SSH opens a secure shell session on another machine. The basic form is:
ssh username@server.univ.fr
This means: connect to server.univ.fr using the account
username.
Once connected, commands are executed remotely. SSH is also used
underneath many other tools such as scp,
rsync, sftp, and some remote IDE
integrations.
Practical example
ssh bgallois@labserver.univ.fr
After authentication, you receive a remote shell prompt. If you run
ls, pwd, or python, they now
refer to the remote system.
Common mistake
Treating SSH like a passive window into another machine. It is not passive. It gives you an active shell on that machine.
SSH config
Why it matters
Manually typing full SSH commands again and again is inconvenient and
error-prone. It gets worse when different machines require different
usernames, ports, or key files. SSH has a built-in solution for this:
~/.ssh/config.
How it works
You define named host entries in ~/.ssh/config. For
example:
Host myserver
HostName server.univ.fr
User bgallois
IdentityFile ~/.ssh/id_server
Then you can connect with:
ssh myserver
instead of typing the full address and username each time.
The advantage extends beyond just SSH. The same host nickname works with other SSH-based tools as well.
Practical example
With the configuration above, all of these become valid:
ssh myserver
scp file.txt myserver:/data/
rsync -avP results/ myserver:/data/results/
sftp myserver
Common mistake
Using only bash aliases for SSH shortcuts. Aliases can simplify a
single shell command, but ~/.ssh/config is the proper
mechanism because it works across the SSH ecosystem.
SSH keys
Why it matters
Password-based login works, but SSH keys are usually better. They are more convenient for frequent access and are standard practice in many technical environments.
How it works
SSH key authentication uses two files:
- a private key that stays on your own machine
- a public key that you place on the remote machine
A modern key pair can be created like this:
ssh-keygen -t ed25519 -C "your.email@lab.fr"
That usually creates:
~/.ssh/id_ed25519~/.ssh/id_ed25519.pub
The .pub file is the public key. The non-.pub
file is the private key and must remain private.
Practical example
Generate a key:
ssh-keygen -t ed25519 -C "your.email@lab.fr"
Then copy the public key to the remote machine:
ssh-copy-id myserver
After that, SSH can automatically authenticate with the key.
Common mistake
Mixing up public and private keys. The public key is meant to be
copied to remote machines. The private key should stay private and
local. Another common issue is incorrect permissions on
~/.ssh, which can cause SSH to reject a key.
VPN
Why it matters
Sometimes SSH is configured correctly, but still fails from home. The reason may be that the remote machine is accessible only from within the institution's network. In that case, a VPN may be required before SSH can even reach the machine.
How it works
A VPN, or Virtual Private Network, provides your device with secure access to the institution's internal network. It does not replace SSH. It only makes internal resources reachable.
So in many environments, the workflow is:
- Connect to the VPN
- Then use SSH to access the machine
Practical example
At home, you start the university VPN client and authenticate. Once connected, this may work:
ssh myserver
Without the VPN, the server may not be reachable at all.
Common mistake
Confusing VPN and SSH. VPN gives access to a network. SSH gives access to a shell on one machine. They solve different problems.
File transfer with scp
Why it matters
Scientific work constantly involves moving files: code, scripts, logs,
intermediate outputs, figures, datasets, and results.
scp is a simple and reliable command for that.
How it works
scp copies files over SSH.
Upload a file:
scp localfile.txt myserver:/path/to/remote/
Download a file:
scp myserver:/path/to/remote/result.txt .
The final "." means the current local directory.
For directories, use -r:
scp -r myfolder myserver:/path/to/remote/
Practical example
Upload a script:
scp analysis.py myserver:/home/bgallois/project/
Download a result file:
scp myserver:/home/bgallois/project/output.csv .
Common mistake
Forgetting which side is local and which side is remote. In
scp, a path with host: is remote. A path
without it is local.
File synchronization with rsync
Why it matters
scp is fine for simple copies, but it is not ideal for
large folders, repeated transfers, or transfers that are interrupted.
In those cases, rsync is often the best tool.
How it works
rsync synchronizes files and directories efficiently,
usually over SSH. It can skip unchanged files and resume interrupted
transfers.
A common pattern is:
rsync -avP data/ myserver:/path/to/data/
Useful flags:
-aarchive mode-vverbose-Pprogress plus partial transfer support
Practical example
Upload a results folder:
rsync -avP results/ myserver:/scratch/project/results/
Download outputs:
rsync -avP myserver:/scratch/project/outputs/ ./outputs/
Common mistake
Misunderstanding the trailing slash. results and
results/ do not mean the same thing in
rsync. One copies the directory itself. The other copies
its contents.
Compression and archiving
Why it matters
Researchers often need to bundle outputs, compress results, prepare transfers, or archive older work. Compression and archiving tools are basic but very useful.
How it works
A common Linux tool is tar. It creates archives and
compresses files.
Create a compressed archive:
tar -czf results.tar.gz results/
Extract it:
tar -xzf results.tar.gz
Compress a single file with gzip:
gzip file.txt
Decompress it:
gunzip file.txt.gz
Practical example
Before downloading a folder containing many small files from a server, you may archive it first:
tar -czf outputs.tar.gz outputs/
scp myserver:/path/outputs.tar.gz .
This is often cleaner than copying thousands of small files individually.
Common mistake
Confusing archiving and compression. tar groups files
together. Compression, such as gzip, reduces size. With
.tar.gz, both ideas are combined.
Environment variables and PATH
Why it matters
A huge amount of shell confusion comes from environment variables,
especially PATH. Many beginners wonder why a command
works in one terminal but not another, or why the wrong version of a
program is being used.
How it works
An environment variable is a named value inherited by processes.
PATH is a special variable containing a list of
directories searched when you type a command.
To inspect it:
echo $PATH
If you type python, the shell looks through the
directories listed in PATH until it finds an executable
named python.
You can define your own environment variable temporarily:
export MYVAR=value
That variable exists in the current shell and in child processes spawned from it.
Practical example
Check where the shell will search for executables:
echo $PATH
Set a variable:
export PROJECT_ROOT=$HOME/project
echo $PROJECT_ROOT
Append a custom directory to PATH:
export PATH="$HOME/bin:$PATH"
If you want that change to persist, place it in
~/.bashrc.
Common mistake
Editing PATH in one shell and assuming the change is
permanent. It is not, unless you add it to a startup file such as
~/.bashrc.
which, type, and command -v
Why it matters
When several versions of a program exist, or aliases and shell functions are involved, it is important to know which command will actually run.
How it works
Useful inspection commands include:
which python
command -v python
type python
command -v and type are generally more
reliable in shell contexts than which, especially when
aliases or functions are involved.
Practical example
Suppose python behaves unexpectedly. Check what it
actually refers to:
type python
command -v python
You may discover it is an alias, a virtual environment executable, or a system binary.
Common mistake
Assuming that typing a command name always refers to the same program on every machine or in every shell.
Permissions and ownership
Why it matters
Many Linux errors boil down to permissions. If you do not understand basic ownership and access bits, messages like "Permission denied" will feel arbitrary.
How it works
A basic listing with permissions is:
ls -l
You may see something like:
-rwxr-xr-- 1 bgallois lab 1234 Mar 29 script.sh
The permission bits are grouped for:
- owner
- group
- others
r means read, w means write, and
x means execute.
Useful commands include:
chmod +x script.sh
chmod 600 private.txt
chown user:group file
Practical example
Make a script executable:
chmod +x run_analysis.sh
Then run it directly:
./run_analysis.sh
Common mistake
Using chmod 777 as a panic solution. That is usually
excessive and often the wrong response. It is better to understand
what access is actually needed.
Symbolic links
Why it matters
Scientific workflows often involve large datasets or repeated directory structures. Symbolic links can avoid unnecessary duplication and make layouts easier to manage.
How it works
A symbolic link is a reference to another path. Create one with:
ln -s /real/path shortcut_name
Inspect where it points:
readlink -f shortcut_name
A symlink is not a copy of the data. It is a pointer.
Practical example
Create a shortcut to a large dataset stored elsewhere:
ln -s /data/shared/cryoem_dataset dataset
Now the path dataset points to the original location
without duplicating the files.
Common mistake
Thinking that a symlink is an independent copy. If the original target disappears, the link becomes broken.
Shell prompt customization with PS1
Why it matters
The shell prompt is not only cosmetic. It is context. A good prompt reduces mistakes by telling you who you are, where you are, and sometimes which machine or environment you are using.
How it works
In bash, the prompt is controlled by the PS1 variable. A
useful prompt is:
export PS1="\u@\h:\w$ "
This includes:
\uusername\hhostname\wcurrent directory
If placed in ~/.bashrc, it is applied automatically to
future shells.
Practical example
With that setting, your prompt might look like:
bgallois@thinkpad:~/work$
This is already much more informative than a bare $.
Common mistake
Treating PS1 customization as pure aesthetics. In real
workflows, it is a practical safety feature.
Making local and remote prompts visually different
Why it matters
When you juggle several terminal windows, local and remote sessions can start to look identical. That makes mistakes much more likely.
How it works
Use different prompts on local and remote machines. Even a simple label helps.
Practical example
On your local machine:
export PS1="[LOCAL] \u@\h:\w$ "
On a remote server:
export PS1="[REMOTE] \u@\h:\w$ "
Then you may see:
[LOCAL] bgallois@thinkpad:~/work$
[REMOTE] bgallois@server01:/scratch/project$
Common mistake
Leaving local and remote prompts identical and assuming memory alone will prevent confusion.
Useful terminal habits
Why it matters
Many productivity gains come from very small habits rather than advanced tools.
How it works
Use shell history search with Ctrl+r. Use tab completion
aggressively. Learn a few inspection commands you will reuse daily.
Helpful commands include:
ls -lh
du -sh myfolder
df -h
find . -name "*.log"
grep "ERROR" logfile.txt
tail -f logfile.txt
head file.txt
wc -l data.csv
less bigfile.txt
sort names.txt
uniq repeated.txt
Practical example
Check folder size:
du -sh results/
Watch a running log:
tail -f simulation.log
Count lines in a file:
wc -l particles.xmd
Common mistake
Retyping everything manually instead of using shell history, completion, and small helper commands.
Text editors on remote machines
Why it matters
Sooner or later, you SSH into a machine and need to edit a config file, script, or job submission file. If you cannot edit from the terminal, you are stuck.
How it works
For beginners, nano is often the easiest terminal editor
to start with. vim is powerful but has a steeper learning
curve.
Open a file with nano:
nano script.sh
Practical example
Edit your SSH config:
nano ~/.ssh/config
Or create a quick batch script:
nano job.sh
Common mistake
SSHing into a machine and then realizing you cannot modify a file because you know no terminal editor.
tmux or screen
Why it matters
Network connections are not perfectly stable. If you do important work in a plain SSH session, you will eventually lose it because of Wi-Fi issues, VPN reconnects, or laptop sleep.
How it works
tmux and screen create persistent terminal
sessions on the remote side. If the connection breaks, the session
continues to run.
With tmux:
tmux new -s work
tmux attach -t work
tmux ls
Practical example
You SSH into a remote workstation, start tmux, and run a
long preprocessing command. Your home network drops briefly. Without
tmux, the shell dies. With tmux, you
reconnect and reattach.
Common mistake
Running important long-lived work directly in a fragile SSH session and assuming the network will not fail.
Process management
Why it matters
On both local and remote machines, you need to know what is running, how much it uses, and how to stop it if necessary.
How it works
Useful commands include:
ps aux
top
htop
kill PID
kill -9 PID
ps aux lists processes.
top and htop provide live views.
kill PID asks a process to terminate.
kill -9 PID forcefully kills it and should be used with
care.
Practical example
Find a stuck Python process:
ps aux | grep python
Then stop it gracefully:
kill 12345
Only use kill -9 if a normal signal does not work.
Common mistake
Using kill -9 immediately for everything. It is forceful
and bypasses normal cleanup.
Background jobs
Why it matters
The shell can run processes in the foreground or background. Understanding this is useful even outside clusters, especially for remote sessions.
How it works
A command followed by & starts in the background:
python script.py &
Useful commands:
jobs
fg
bg
jobs lists background jobs in the current shell.
fg brings one back to the foreground.
bg resumes a stopped job in the background.
Practical example
Start a long-running script, but keep using the same terminal:
python preprocess.py &
jobs
Later bring it back:
fg
Common mistake
Thinking a background job is safe from terminal closure. If the shell
exits, the job may still die unless you use something like
tmux, screen, or another persistence
mechanism.
Exit codes and command chaining
Why it matters
A shell command usually reports success or failure through an exit code. Understanding that helps when debugging and writing scripts.
How it works
After a command, inspect the exit code with:
echo $?
Conventionally, 0 indicates success and nonzero values
indicate failure.
Shell chaining operators are also very useful:
cmd1 && cmd2
cmd1 || echo "failed"
cmd2 runs after && only if
cmd1 succeeded.
Practical example
Run the second step only if the first step works:
mkdir results && cp output.txt results/
Common mistake
Ignoring the exit status and assuming a command worked because it produced some output.
Standard output, standard error, and redirection
Why it matters
Understanding output streams is essential for logs, scripts, and debugging. Many beginners do not know why some messages still appear on the screen even after output redirection.
How it works
Programs usually write normal output to standard output and error messages to standard error.
Redirect standard output:
python script.py > out.txt
Redirect standard error:
python script.py 2> err.txt
Redirect both to one file:
python script.py > all.txt 2>&1
Practical example
Run a script and save errors separately:
python analysis.py > analysis.out 2> analysis.err
Now, normal logs and error logs are distinct.
Common mistake
Redirecting only standard output and assuming all messages will be captured. Error messages may still go to the terminal unless standard error is redirected too.
Searching text with grep
Why it matters
A large part of scientific computing involves text files: logs, configuration files, scripts, metadata files, CSV-like outputs, scheduler logs, and software messages. Being able to search quickly inside text is one of the most useful command-line skills.
grep lets you search for lines containing a pattern. It
is simple, fast, and extremely useful for debugging and inspection.
How it works
The basic form is:
grep "pattern" file.txt
This prints lines from file.txt that contain the pattern.
Some very useful options are:
-ifor case-insensitive search-nto show line numbers-rto search recursively in directories-vto invert the match-Efor extended regular expressions
Practical example
Search for errors in a log file:
grep -i "error" job.err
Search recursively for a parameter name in a project folder:
grep -rn "learning_rate" .
Exclude commented lines from a config file:
grep -v "^#" config.txt
Find lines matching one of several words:
grep -E "ERROR|WARNING|FAILED" logfile.txt
Common mistake
Treating grep as if it only works for exact literal
words. In practice, it is much more flexible, especially with options
like -i, -r, and -E.
Combining commands with the pipe |
Why it matters
A lot of command-line efficiency comes from chaining small tools
rather than searching for a single giant command that does everything.
The pipe operator | lets you send the output of one
command directly into another. This is extremely useful for filtering,
counting, sorting, and searching.
It is especially helpful when working with logs, process lists, CSV-like text files, and general command output.
How it works
The basic form is:
command1 | command2
This means: run command1, then pass its standard output
to command2.
You can chain several commands together:
command1 | command2 | command3
That creates a pipeline in which each command transforms or filters the output from the previous one.
Practical example
Search for a file in a directory listing:
ls -l | grep "report"
Search for running Python processes:
ps aux | grep python
Count how many error lines appear in a log:
grep "ERROR" logfile.txt | wc -l
Show only the first matching lines:
grep "ATOM" structure.pdb | head
Show only the last matching lines:
grep "ERROR" job.err | tail
Count unique values in the first column of a CSV-like file:
cut -d',' -f1 data.csv | sort | uniq -c
Common mistake
Using a pipe when the second command already accepts a filename directly. For example:
grep "x" file.txt
is usually better than:
cat file.txt | grep "x"
The second form works, but it is unnecessary here.
Another common mistake is forgetting that a pipe passes only standard output, not standard error.
Logs and diagnostics
Why it matters
When something fails, logs are often the fastest route to the answer. Guessing is slower than reading what the system already reported.
How it works
Use tools like:
less job.out
less job.err
tail -n 50 job.err
grep -i error job.err
These help inspect large logs and find relevant messages.
Practical example
A script failed on a remote server. Instead of rerunning mindlessly:
tail -n 30 analysis.err
You may immediately see "file not found", "permission denied", or a Python traceback.
Common mistake
Treating logs as something only advanced users read. Logs are a core diagnostic tool for everyone.
Disk usage and file hygiene
Why it matters
Scientific workflows create lots of files. Poor file hygiene leads to full disks, confusion, duplicated data, and an irreproducible mess.
How it works
Useful commands include:
du -sh .
du -sh results/*
df -h
find . -type f | wc -l
Organize outputs clearly. Separate raw data, processed data, and temporary files.
Practical example
Before launching a workflow on a remote server, check available space:
df -h
Then inspect old output sizes:
du -sh old_results/
Common mistake
Ignoring storage until a workflow crashes due to a full filesystem.
VNC and remote graphical access
Why it matters
Not all scientific work is purely command-line based. Some workflows require a GUI, a visualization program, or a remote desktop.
How it works
VNC provides remote desktop access. Instead of only giving a shell, it gives access to a graphical desktop session on the remote machine.
SSH and VNC solve different problems:
- SSH gives command-line access
- VNC gives graphical desktop access
VNC may itself require VPN or SSH tunneling, depending on the infrastructure.
Practical example
A lab workstation hosts microscopy software or a visualization tool. You connect through VPN, then use a VNC client to access the desktop remotely.
Common mistake
Treating VNC as a general replacement for SSH. Usually, it is not. SSH remains the main tool for shell-based work.
Modules, conda, and environments
Why it matters
Many scientific problems that look mysterious are actually environment problems. Wrong Python or library version, missing executable, or conflicting dependencies: these are extremely common.
How it works
On shared systems, environment modules are often used:
module avail
module load python
module list
For Python projects, isolated environments are also common, for
example, with venv or conda.
A virtual environment with Python's built-in tools:
python -m venv env
source env/bin/activate
Conda environments follow a similar idea, though with different commands.
Practical example
On a shared server, you may first load a system Python module:
module load python
Then activate a project-specific environment:
source ~/project/env/bin/activate
Now the shell uses the software versions intended for that project.
Common mistake
Installing everything into one global environment and hoping all projects will remain compatible forever.
Remote server vs cluster
Why it matters
Not every remote Linux machine should be used in the same way. A personal server, lab workstation, and HPC cluster have different expectations.
How it works
A remote server or workstation may be a single machine where you SSH in and run your work directly. A cluster usually has login nodes, compute nodes, and a scheduler to manage shared resources.
Most habits in this handbook apply to both. The main difference is how heavy compute tasks should be launched.
Practical example
On a lab workstation, you may SSH in and run a Python script directly. On a cluster, you usually prepare a job submission script and let the scheduler run it on a compute node.
Common mistake
Assuming all remote infrastructure works like a personal machine.
Scheduler basics for cluster users
Why it matters
This is the most cluster-specific part. If you use a cluster, you need at least basic scheduler literacy.
How it works
A scheduler such as Slurm manages shared resources. You request time, CPUs, and memory, and then wait for the job to run on suitable compute nodes.
Useful basic commands:
sbatch myjob.sh
squeue -u $USER
scancel 123456
A minimal Slurm script might look like:
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=job.out
#SBATCH --error=job.err
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
python myscript.py
Practical example
Save the script as myjob.sh and submit it:
sbatch myjob.sh
Then monitor the queue:
squeue -u $USER
Common mistake
Running heavy workloads directly on a login node instead of submitting them properly through the scheduler.
Understanding commands, options, and arguments
Why it matters
Many shell commands follow a similar pattern, but beginners often treat them as arbitrary strings to memorize. In reality, most commands are built from the same pieces: the command name itself, optional flags or options, and one or more arguments.
Understanding that structure makes the command line much easier to read, modify, and debug.
How it works
A typical command looks like this:
command [options] [arguments]
- The command is the program to run
-
The options modify its behavior, often starting with
-or-- - The arguments are the objects the command acts on, such as files, directories, patterns, or values
For example:
grep -i "error" logfile.txt
Here:
grepis the command-iis an option, meaning case-insensitive search"error"is an argument: the search pattern-
logfile.txtis another argument: the file to search
Another example:
scp -r results/ myserver:/data/
Here:
scpis the command-ris an option, meaning recursive copyresults/is an argument: the source-
myserver:/data/is another argument: the destination
Some commands also support long options, for example:
ls --human-readable
Though short forms such as -h are often more common.
Practical example
Consider:
tar -czf archive.tar.gz results/
This can be read as:
tar→ the command-c→ create an archive-z→ compress with gzip-f→ use the following filename for the archivearchive.tar.gz→ the archive filenameresults/→ the folder to archive
Once you read commands this way, they stop looking random.
Common mistake
Treating everything after the command as an undifferentiated string. That makes it harder to understand what can be changed safely.
Another common mistake is forgetting that the order of arguments can matter. For example, in:
cp source.txt destination.txt
The first argument is the source, and the second is the destination. Reversing them changes the meaning.
Script arguments: $1, $2, and "$@"
Why it matters
Shell scripts often need input values such as a filename, a directory, or a parameter. Instead of hardcoding them, scripts can accept command-line arguments.
How it works
If you run:
bash myscript.sh input.txt output.txt
then inside the script:
$1isinput.txt$2isoutput.txt"$@"means all arguments passed to the script.
Practical example
#!/bin/bash
echo "Input file: $1"
echo "Output file: $2"
Run it with:
bash myscript.sh data.csv results.txt
Common mistake
Using $1, $2, and so on without checking
whether the user actually provided enough arguments.
Writing scripts instead of repeating commands manually
Why it matters
If you repeat the same sequence of commands several times, typing them manually is both slower and more error-prone than writing a small script.
How it works
Put repeated workflows into a shell script. This makes them explicit, reusable, and easier to reproduce later.
Practical example
Instead of repeatedly typing:
source ~/env/bin/activate
python preprocess.py
python analyze.py
put them into run_analysis.sh:
#!/bin/bash
source ~/env/bin/activate
python preprocess.py
python analyze.py
and run:
bash run_analysis.sh
Common mistake
Relying on memory or old shell history as the only documentation of a workflow.
Reproducibility and workflow logging
Why it matters
Scientific computing is not only about running commands successfully once. It is also about understanding and reproducing what was done later.
How it works
Keep scripts, record important parameters, note software versions, and store the project structure deliberately. Small habits go a long way:
- Keep run scripts under version control
-
Write a short
README.mdin the project directories - Save the exact commands used for important runs
- Record software versions when relevant
Useful commands include:
python --version
git rev-parse HEAD
Practical example
After a successful analysis, store the script used, the command line, the Git commit hash, and the environment version to avoid relying on memory.
Common mistake
Treating reproducibility as something to think about only when writing the paper.
Git basics
Why it matters
Version control is one of the most valuable skills many young researchers lack. Even basic Git knowledge makes code, notes, and scripts safer and easier to evolve.
How it works
Git tracks changes to files over time. At a minimum, it is useful for:
- scripts
- source code
- configuration files
- small documentation files
It is usually not ideal for very large raw datasets.
Common basic commands include:
git init
git status
git add script.py
git commit -m "Add preprocessing script"
A .gitignore file helps exclude generated junk and large
temporary outputs.
Practical example
Initialize a Git repository in your analysis folder and commit scripts and configuration files as they evolve.
Common mistake
Either using no version control at all, or trying to version huge generated datasets directly without planning.
Safer destructive commands
Why it matters
A single destructive command in the wrong place can cause serious damage, especially on remote machines. Good habits reduce that risk.
How it works
Before deleting or moving important things, confirm context:
pwd
ls
Be careful with wildcards. Understand what variables expand to. Test commands on dummy files if unsure.
Sometimes interactive variants are useful:
rm -i file.txt
Practical example
Before deleting a folder on a remote machine:
pwd
ls
rm -r old_results
Those first two commands are a simple but effective safety check.
Common mistake
Typing rm -rf quickly from habit, without confirming
location and target first.
PGP keys and signed Git commits
Why it matters
When you sign a Git commit or tag, you add cryptographic proof that your key really created it. This is useful when you want to verify authorship, especially in collaborative or public projects.
For many small personal research projects, unsigned commits are completely fine. But for shared codebases, public repositories, releases, or long-lived scientific software projects, signed commits, and signed tags can be a useful trust and traceability mechanism.
It also helps distinguish between authentication and authorship. SSH keys help you access machines and services. PGP keys help you sign work.
How it works
PGP uses a public/private key pair, just like other public-key systems.
- the private key stays on your machine
- the public key can be shared
Git can use a PGP key to sign commits and tags. Platforms such as GitHub or GitLab can then show that a commit is verified if the matching public key has been added to your account.
Typical workflow:
- Generate a PGP key
- Tell Git which key to use
- Enable commit signing
- Optionally upload the public key to your Git hosting account
A common GnuPG command to generate a key is:
gpg --full-generate-key
List available secret keys:
gpg --list-secret-keys --keyid-format LONG
Configure Git to use one:
git config --global user.signingkey YOURKEYID
git config --global commit.gpgsign true
Then a normal commit will be signed automatically, or you can sign explicitly with:
git commit -S -m "Add preprocessing script"
You can also sign annotated tags, which is often especially useful for releases:
git tag -s v1.0.0 -m "Version 1.0.0"
Practical example
Generate a key:
gpg --full-generate-key
Find the key ID:
gpg --list-secret-keys --keyid-format LONG
Configure Git:
git config --global user.signingkey YOURKEYID
git config --global commit.gpgsign true
Make a signed commit:
git commit -S -m "Add analysis script"
Common mistake
Confusing SSH keys and PGP keys. They solve different problems. SSH keys are mainly for authentication to systems and services. PGP keys are used for signing and verification.
Another common mistake is enabling commit signing before the key is properly configured in Git or uploaded to the hosting platform, which can make verification appear to fail even though the commit was signed locally.
Final thoughts
Linux and remote computing are often presented either as trivial knowledge everyone should already know or as arcane knowledge reserved for specialists. In practice, the truth is simpler. There is a middle ground of practical skills that every scientist benefits from learning.
None of these skills are glamorous. But they are high-leverage. They save time, reduce mistakes, and make you more autonomous across laptops, remote servers, workstations, and clusters.
That is why they are worth learning early and teaching clearly.