A Practical Linux and Remote Computing Handbook for Scientists

Mar 29, 2026 • by Benjamin Gallois

A Practical Linux and Remote Computing Handbook for Scientists

Contents

Introduction: what this handbook is for
Foundations: core ideas before using remote Linux systems
- Local vs remote machine: know where your command runs
- Commands, options, and arguments: read shell syntax clearly
Remote access: connect securely to distant machines
Files and data movement: move, compress, and organize data
Shell essentials: understand what the terminal is doing
Working effectively on remote machines: stay productive once connected
Environments, scripting, and reproducibility: make work easier to rerun and trust
Cluster-specific topics: what changes on shared HPC systems
- Remote server vs cluster: know what kind of machine you use
- Scheduler basics: submit jobs properly on a cluster
Safety: reduce the risk of costly command-line mistakes
- Permissions and ownership: fix "Permission denied" issues
- Safer destructive commands: reduce the risk of deleting the wrong thing
Final thoughts: why these skills matter early

Introduction

Scientific work rarely happens on a single machine anymore. You may write code on your laptop, preprocess data on a workstation, launch long computations on a remote server, visualize results on another machine, and submit heavy jobs to a cluster. Even when the science itself is clear, the operational side can still be messy: where am I working right now? Which machine is running this code? Why does a file exist on one system but not another? Why does something work on campus but not from home? Why did my process stop when my connection dropped?

Many researchers are never taught these things systematically. They learn fragments from colleagues, old lab notes, shell history, or trial and error. The result is often functional but fragile. People get by, but they waste time and make avoidable mistakes.

This short handbook-style article is meant to fill that gap. It is not a Linux course and not a system administration guide. It is a practical handbook for scientists who work across laptops, local Linux machines, remote servers, institutional infrastructure, and sometimes clusters. Most of the ideas apply broadly.

Each section follows the same logic: why it matters, how it works, a practical example, and a common mistake.

Local machine vs remote machine

Why it matters

Many beginner mistakes stem from forgetting whether a command is running on the local computer or another machine.

Where files are located
Which software is installed
Where outputs are written
Which CPU and memory are being used
What a destructive command will actually affect

A command like rm, mv, python, or tar is not dangerous or safe by itself. It depends entirely on where you run it.

How it works

Your local machine is the laptop or desktop in front of you. A remote machine is another computer you access through the network. Once you connect to it through SSH, commands typed in that shell are executed on the remote side.

Two commands are especially useful:

pwd
hostname

pwd shows the current directory.

hostname shows which machine you are on.

If you feel unsure, these commands restore context immediately.

Practical example

You start on your laptop:

hostname

which may return:

thinkpad

Then you connect:

ssh myserver

and now:

hostname

may return:

server01

From that point onward, commands are acting on server01, not on your laptop.

Common mistake

Thinking that a terminal window belongs to one machine forever. It does not. The terminal is only the interface. The actual execution context is defined by the shell session you are in.

SSH

Why it matters

SSH is the standard tool for secure command-line access to a remote Linux machine. If you use a lab workstation, departmental server, personal server, or cluster login node, SSH is usually the main entry point.

Without SSH literacy, remote scientific computing remains uncomfortable.

How it works

SSH opens a secure shell session on another machine. The basic form is:

ssh username@server.univ.fr

This means: connect to server.univ.fr using the account username.

Once connected, commands are executed remotely. SSH is also used underneath many other tools such as scp, rsync, sftp, and some remote IDE integrations.

Practical example

ssh bgallois@labserver.univ.fr

After authentication, you receive a remote shell prompt. If you run ls, pwd, or python, they now refer to the remote system.

Common mistake

Treating SSH like a passive window into another machine. It is not passive. It gives you an active shell on that machine.

SSH config

Why it matters

Manually typing full SSH commands again and again is inconvenient and error-prone. It gets worse when different machines require different usernames, ports, or key files. SSH has a built-in solution for this: ~/.ssh/config.

How it works

You define named host entries in ~/.ssh/config. For example:

Host myserver
    HostName server.univ.fr
    User bgallois
    IdentityFile ~/.ssh/id_server

Then you can connect with:

ssh myserver

instead of typing the full address and username each time.

The advantage extends beyond just SSH. The same host nickname works with other SSH-based tools as well.

Practical example

With the configuration above, all of these become valid:

ssh myserver
scp file.txt myserver:/data/
rsync -avP results/ myserver:/data/results/
sftp myserver

Common mistake

Using only bash aliases for SSH shortcuts. Aliases can simplify a single shell command, but ~/.ssh/config is the proper mechanism because it works across the SSH ecosystem.

SSH keys

Why it matters

Password-based login works, but SSH keys are usually better. They are more convenient for frequent access and are standard practice in many technical environments.

How it works

SSH key authentication uses two files:

a private key that stays on your own machine
a public key that you place on the remote machine

A modern key pair can be created like this:

ssh-keygen -t ed25519 -C "your.email@lab.fr"

That usually creates:

~/.ssh/id_ed25519
~/.ssh/id_ed25519.pub

The .pub file is the public key. The non-.pub file is the private key and must remain private.

Practical example

Generate a key:

ssh-keygen -t ed25519 -C "your.email@lab.fr"

Then copy the public key to the remote machine:

ssh-copy-id myserver

After that, SSH can automatically authenticate with the key.

Common mistake

Mixing up public and private keys. The public key is meant to be copied to remote machines. The private key should stay private and local. Another common issue is incorrect permissions on ~/.ssh, which can cause SSH to reject a key.

VPN

Why it matters

Sometimes SSH is configured correctly, but still fails from home. The reason may be that the remote machine is accessible only from within the institution's network. In that case, a VPN may be required before SSH can even reach the machine.

How it works

A VPN, or Virtual Private Network, provides your device with secure access to the institution's internal network. It does not replace SSH. It only makes internal resources reachable.

So in many environments, the workflow is:

Connect to the VPN
Then use SSH to access the machine

Practical example

At home, you start the university VPN client and authenticate. Once connected, this may work:

ssh myserver

Without the VPN, the server may not be reachable at all.

Common mistake

Confusing VPN and SSH. VPN gives access to a network. SSH gives access to a shell on one machine. They solve different problems.

File transfer with scp

Why it matters

Scientific work constantly involves moving files: code, scripts, logs, intermediate outputs, figures, datasets, and results. scp is a simple and reliable command for that.

How it works

scp copies files over SSH.

Upload a file:

scp localfile.txt myserver:/path/to/remote/

Download a file:

scp myserver:/path/to/remote/result.txt .

The final "." means the current local directory.

For directories, use -r:

scp -r myfolder myserver:/path/to/remote/

Practical example

Upload a script:

scp analysis.py myserver:/home/bgallois/project/

Download a result file:

scp myserver:/home/bgallois/project/output.csv .

Common mistake

Forgetting which side is local and which side is remote. In scp, a path with host: is remote. A path without it is local.

File synchronization with rsync

Why it matters

scp is fine for simple copies, but it is not ideal for large folders, repeated transfers, or transfers that are interrupted. In those cases, rsync is often the best tool.

How it works

rsync synchronizes files and directories efficiently, usually over SSH. It can skip unchanged files and resume interrupted transfers.

A common pattern is:

rsync -avP data/ myserver:/path/to/data/

Useful flags:

-a archive mode
-v verbose
-P progress plus partial transfer support

Practical example

Upload a results folder:

rsync -avP results/ myserver:/scratch/project/results/

Download outputs:

rsync -avP myserver:/scratch/project/outputs/ ./outputs/

Common mistake

Misunderstanding the trailing slash. results and results/ do not mean the same thing in rsync. One copies the directory itself. The other copies its contents.

Compression and archiving

Why it matters

Researchers often need to bundle outputs, compress results, prepare transfers, or archive older work. Compression and archiving tools are basic but very useful.

How it works

A common Linux tool is tar. It creates archives and compresses files.

Create a compressed archive:

tar -czf results.tar.gz results/

Extract it:

tar -xzf results.tar.gz

Compress a single file with gzip:

gzip file.txt

Decompress it:

gunzip file.txt.gz

Practical example

Before downloading a folder containing many small files from a server, you may archive it first:

tar -czf outputs.tar.gz outputs/
scp myserver:/path/outputs.tar.gz .

This is often cleaner than copying thousands of small files individually.

Common mistake

Confusing archiving and compression. tar groups files together. Compression, such as gzip, reduces size. With .tar.gz, both ideas are combined.

Environment variables and PATH

Why it matters

A huge amount of shell confusion comes from environment variables, especially PATH. Many beginners wonder why a command works in one terminal but not another, or why the wrong version of a program is being used.

How it works

An environment variable is a named value inherited by processes. PATH is a special variable containing a list of directories searched when you type a command.

To inspect it:

echo $PATH

If you type python, the shell looks through the directories listed in PATH until it finds an executable named python.

You can define your own environment variable temporarily:

export MYVAR=value

That variable exists in the current shell and in child processes spawned from it.

Practical example

Check where the shell will search for executables:

echo $PATH

Set a variable:

export PROJECT_ROOT=$HOME/project
echo $PROJECT_ROOT

Append a custom directory to PATH:

export PATH="$HOME/bin:$PATH"

If you want that change to persist, place it in ~/.bashrc.

Common mistake

Editing PATH in one shell and assuming the change is permanent. It is not, unless you add it to a startup file such as ~/.bashrc.

which, type, and command -v

Why it matters

When several versions of a program exist, or aliases and shell functions are involved, it is important to know which command will actually run.

How it works

Useful inspection commands include:

which python
command -v python
type python

command -v and type are generally more reliable in shell contexts than which, especially when aliases or functions are involved.

Practical example

Suppose python behaves unexpectedly. Check what it actually refers to:

type python
command -v python

You may discover it is an alias, a virtual environment executable, or a system binary.

Common mistake

Assuming that typing a command name always refers to the same program on every machine or in every shell.

Permissions and ownership

Why it matters

Many Linux errors boil down to permissions. If you do not understand basic ownership and access bits, messages like "Permission denied" will feel arbitrary.

How it works

A basic listing with permissions is:

ls -l

You may see something like:

-rwxr-xr-- 1 bgallois lab 1234 Mar 29 script.sh

The permission bits are grouped for:

owner
group
others

r means read, w means write, and x means execute.

Useful commands include:

chmod +x script.sh
chmod 600 private.txt
chown user:group file

Practical example

Make a script executable:

chmod +x run_analysis.sh

Then run it directly:

./run_analysis.sh

Common mistake

Using chmod 777 as a panic solution. That is usually excessive and often the wrong response. It is better to understand what access is actually needed.

Symbolic links

Why it matters

Scientific workflows often involve large datasets or repeated directory structures. Symbolic links can avoid unnecessary duplication and make layouts easier to manage.

How it works

A symbolic link is a reference to another path. Create one with:

ln -s /real/path shortcut_name

Inspect where it points:

readlink -f shortcut_name

A symlink is not a copy of the data. It is a pointer.

Practical example

Create a shortcut to a large dataset stored elsewhere:

ln -s /data/shared/cryoem_dataset dataset

Now the path dataset points to the original location without duplicating the files.

Common mistake

Thinking that a symlink is an independent copy. If the original target disappears, the link becomes broken.

Shell prompt customization with PS1

Why it matters

The shell prompt is not only cosmetic. It is context. A good prompt reduces mistakes by telling you who you are, where you are, and sometimes which machine or environment you are using.

How it works

In bash, the prompt is controlled by the PS1 variable. A useful prompt is:

export PS1="\u@\h:\w$ "

This includes:

\u username
\h hostname
\w current directory

If placed in ~/.bashrc, it is applied automatically to future shells.

Practical example

With that setting, your prompt might look like:

bgallois@thinkpad:~/work$

This is already much more informative than a bare $.

Common mistake

Treating PS1 customization as pure aesthetics. In real workflows, it is a practical safety feature.

Making local and remote prompts visually different

Why it matters

When you juggle several terminal windows, local and remote sessions can start to look identical. That makes mistakes much more likely.

How it works

Use different prompts on local and remote machines. Even a simple label helps.

Practical example

On your local machine:

export PS1="[LOCAL] \u@\h:\w$ "

On a remote server:

export PS1="[REMOTE] \u@\h:\w$ "

Then you may see:

[LOCAL] bgallois@thinkpad:~/work$
[REMOTE] bgallois@server01:/scratch/project$

Common mistake

Leaving local and remote prompts identical and assuming memory alone will prevent confusion.

Useful terminal habits

Why it matters

Many productivity gains come from very small habits rather than advanced tools.

How it works

Use shell history search with Ctrl+r. Use tab completion aggressively. Learn a few inspection commands you will reuse daily.

Helpful commands include:

ls -lh
du -sh myfolder
df -h
find . -name "*.log"
grep "ERROR" logfile.txt
tail -f logfile.txt
head file.txt
wc -l data.csv
less bigfile.txt
sort names.txt
uniq repeated.txt

Practical example

Check folder size:

du -sh results/

Watch a running log:

tail -f simulation.log

Count lines in a file:

wc -l particles.xmd

Common mistake

Retyping everything manually instead of using shell history, completion, and small helper commands.

Text editors on remote machines

Why it matters

Sooner or later, you SSH into a machine and need to edit a config file, script, or job submission file. If you cannot edit from the terminal, you are stuck.

How it works

For beginners, nano is often the easiest terminal editor to start with. vim is powerful but has a steeper learning curve.

Open a file with nano:

nano script.sh

Practical example

Edit your SSH config:

nano ~/.ssh/config

Or create a quick batch script:

nano job.sh

Common mistake

SSHing into a machine and then realizing you cannot modify a file because you know no terminal editor.

tmux or screen

Why it matters

Network connections are not perfectly stable. If you do important work in a plain SSH session, you will eventually lose it because of Wi-Fi issues, VPN reconnects, or laptop sleep.

How it works

tmux and screen create persistent terminal sessions on the remote side. If the connection breaks, the session continues to run.

With tmux:

tmux new -s work
tmux attach -t work
tmux ls

Practical example

You SSH into a remote workstation, start tmux, and run a long preprocessing command. Your home network drops briefly. Without tmux, the shell dies. With tmux, you reconnect and reattach.

Common mistake

Running important long-lived work directly in a fragile SSH session and assuming the network will not fail.

Process management

Why it matters

On both local and remote machines, you need to know what is running, how much it uses, and how to stop it if necessary.

How it works

Useful commands include:

ps aux
top
htop
kill PID
kill -9 PID

ps aux lists processes.

top and htop provide live views.

kill PID asks a process to terminate.

kill -9 PID forcefully kills it and should be used with care.

Practical example

Find a stuck Python process:

ps aux | grep python

Then stop it gracefully:

kill 12345

Only use kill -9 if a normal signal does not work.

Common mistake

Using kill -9 immediately for everything. It is forceful and bypasses normal cleanup.

Background jobs

Why it matters

The shell can run processes in the foreground or background. Understanding this is useful even outside clusters, especially for remote sessions.

How it works

A command followed by & starts in the background:

python script.py &

Useful commands:

jobs
fg
bg

jobs lists background jobs in the current shell.

fg brings one back to the foreground.

bg resumes a stopped job in the background.

Practical example

Start a long-running script, but keep using the same terminal:

python preprocess.py &
jobs

Later bring it back:

fg

Common mistake

Thinking a background job is safe from terminal closure. If the shell exits, the job may still die unless you use something like tmux, screen, or another persistence mechanism.

Exit codes and command chaining

Why it matters

A shell command usually reports success or failure through an exit code. Understanding that helps when debugging and writing scripts.

How it works

After a command, inspect the exit code with:

echo $?

Conventionally, 0 indicates success and nonzero values indicate failure.

Shell chaining operators are also very useful:

cmd1 && cmd2
cmd1 || echo "failed"

cmd2 runs after && only if cmd1 succeeded.

Practical example

Run the second step only if the first step works:

mkdir results && cp output.txt results/

Common mistake

Ignoring the exit status and assuming a command worked because it produced some output.

Standard output, standard error, and redirection

Why it matters

Understanding output streams is essential for logs, scripts, and debugging. Many beginners do not know why some messages still appear on the screen even after output redirection.

How it works

Programs usually write normal output to standard output and error messages to standard error.

Redirect standard output:

python script.py > out.txt

Redirect standard error:

python script.py 2> err.txt

Redirect both to one file:

python script.py > all.txt 2>&1

Practical example

Run a script and save errors separately:

python analysis.py > analysis.out 2> analysis.err

Now, normal logs and error logs are distinct.

Common mistake

Redirecting only standard output and assuming all messages will be captured. Error messages may still go to the terminal unless standard error is redirected too.

Searching text with grep

Why it matters

A large part of scientific computing involves text files: logs, configuration files, scripts, metadata files, CSV-like outputs, scheduler logs, and software messages. Being able to search quickly inside text is one of the most useful command-line skills.

grep lets you search for lines containing a pattern. It is simple, fast, and extremely useful for debugging and inspection.

How it works

The basic form is:

grep "pattern" file.txt

This prints lines from file.txt that contain the pattern.

Some very useful options are:

-i for case-insensitive search
-n to show line numbers
-r to search recursively in directories
-v to invert the match
-E for extended regular expressions

Practical example

Search for errors in a log file:

grep -i "error" job.err

Search recursively for a parameter name in a project folder:

grep -rn "learning_rate" .

Exclude commented lines from a config file:

grep -v "^#" config.txt

Find lines matching one of several words:

grep -E "ERROR|WARNING|FAILED" logfile.txt

Common mistake

Treating grep as if it only works for exact literal words. In practice, it is much more flexible, especially with options like -i, -r, and -E.

Combining commands with the pipe |

Why it matters

A lot of command-line efficiency comes from chaining small tools rather than searching for a single giant command that does everything. The pipe operator | lets you send the output of one command directly into another. This is extremely useful for filtering, counting, sorting, and searching.

It is especially helpful when working with logs, process lists, CSV-like text files, and general command output.

How it works

The basic form is:

command1 | command2

This means: run command1, then pass its standard output to command2.

You can chain several commands together:

command1 | command2 | command3

That creates a pipeline in which each command transforms or filters the output from the previous one.

Practical example

Search for a file in a directory listing:

ls -l | grep "report"

Search for running Python processes:

ps aux | grep python

Count how many error lines appear in a log:

grep "ERROR" logfile.txt | wc -l

Show only the first matching lines:

grep "ATOM" structure.pdb | head

Show only the last matching lines:

grep "ERROR" job.err | tail

Count unique values in the first column of a CSV-like file:

cut -d',' -f1 data.csv | sort | uniq -c

Common mistake

Using a pipe when the second command already accepts a filename directly. For example:

grep "x" file.txt

is usually better than:

cat file.txt | grep "x"

The second form works, but it is unnecessary here.

Another common mistake is forgetting that a pipe passes only standard output, not standard error.

Logs and diagnostics

Why it matters

When something fails, logs are often the fastest route to the answer. Guessing is slower than reading what the system already reported.

How it works

Use tools like:

less job.out
less job.err
tail -n 50 job.err
grep -i error job.err

These help inspect large logs and find relevant messages.

Practical example

A script failed on a remote server. Instead of rerunning mindlessly:

tail -n 30 analysis.err

You may immediately see "file not found", "permission denied", or a Python traceback.

Common mistake

Treating logs as something only advanced users read. Logs are a core diagnostic tool for everyone.

Disk usage and file hygiene

Why it matters

Scientific workflows create lots of files. Poor file hygiene leads to full disks, confusion, duplicated data, and an irreproducible mess.

How it works

Useful commands include:

du -sh .
du -sh results/*
df -h
find . -type f | wc -l

Organize outputs clearly. Separate raw data, processed data, and temporary files.

Practical example

Before launching a workflow on a remote server, check available space:

df -h

Then inspect old output sizes:

du -sh old_results/

Common mistake

Ignoring storage until a workflow crashes due to a full filesystem.

VNC and remote graphical access

Why it matters

Not all scientific work is purely command-line based. Some workflows require a GUI, a visualization program, or a remote desktop.

How it works

VNC provides remote desktop access. Instead of only giving a shell, it gives access to a graphical desktop session on the remote machine.

SSH and VNC solve different problems:

SSH gives command-line access
VNC gives graphical desktop access

VNC may itself require VPN or SSH tunneling, depending on the infrastructure.

Practical example

A lab workstation hosts microscopy software or a visualization tool. You connect through VPN, then use a VNC client to access the desktop remotely.

Common mistake

Treating VNC as a general replacement for SSH. Usually, it is not. SSH remains the main tool for shell-based work.

Modules, conda, and environments

Why it matters

Many scientific problems that look mysterious are actually environment problems. Wrong Python or library version, missing executable, or conflicting dependencies: these are extremely common.

How it works

On shared systems, environment modules are often used:

module avail
module load python
module list

For Python projects, isolated environments are also common, for example, with venv or conda.

A virtual environment with Python's built-in tools:

python -m venv env
source env/bin/activate

Conda environments follow a similar idea, though with different commands.

Practical example

On a shared server, you may first load a system Python module:

module load python

Then activate a project-specific environment:

source ~/project/env/bin/activate

Now the shell uses the software versions intended for that project.

Common mistake

Installing everything into one global environment and hoping all projects will remain compatible forever.

Remote server vs cluster

Why it matters

Not every remote Linux machine should be used in the same way. A personal server, lab workstation, and HPC cluster have different expectations.

How it works

A remote server or workstation may be a single machine where you SSH in and run your work directly. A cluster usually has login nodes, compute nodes, and a scheduler to manage shared resources.

Most habits in this handbook apply to both. The main difference is how heavy compute tasks should be launched.

Practical example

On a lab workstation, you may SSH in and run a Python script directly. On a cluster, you usually prepare a job submission script and let the scheduler run it on a compute node.

Common mistake

Assuming all remote infrastructure works like a personal machine.

Scheduler basics for cluster users

Why it matters

This is the most cluster-specific part. If you use a cluster, you need at least basic scheduler literacy.

How it works

A scheduler such as Slurm manages shared resources. You request time, CPUs, and memory, and then wait for the job to run on suitable compute nodes.

Useful basic commands:

sbatch myjob.sh
squeue -u $USER
scancel 123456

A minimal Slurm script might look like:

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=job.out
#SBATCH --error=job.err
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G

python myscript.py

Practical example

Save the script as myjob.sh and submit it:

sbatch myjob.sh

Then monitor the queue:

squeue -u $USER

Common mistake

Running heavy workloads directly on a login node instead of submitting them properly through the scheduler.

Understanding commands, options, and arguments

Why it matters

Many shell commands follow a similar pattern, but beginners often treat them as arbitrary strings to memorize. In reality, most commands are built from the same pieces: the command name itself, optional flags or options, and one or more arguments.

Understanding that structure makes the command line much easier to read, modify, and debug.

How it works

A typical command looks like this:

command [options] [arguments]

The command is the program to run
The options modify its behavior, often starting with - or --
The arguments are the objects the command acts on, such as files, directories, patterns, or values

For example:

grep -i "error" logfile.txt

Here:

grep is the command
-i is an option, meaning case-insensitive search
"error" is an argument: the search pattern
logfile.txt is another argument: the file to search

Another example:

scp -r results/ myserver:/data/

Here:

scp is the command
-r is an option, meaning recursive copy
results/ is an argument: the source
myserver:/data/ is another argument: the destination

Some commands also support long options, for example:

ls --human-readable

Though short forms such as -h are often more common.

Practical example

Consider:

tar -czf archive.tar.gz results/

This can be read as:

tar → the command
-c → create an archive
-z → compress with gzip
-f → use the following filename for the archive
archive.tar.gz → the archive filename
results/ → the folder to archive

Once you read commands this way, they stop looking random.

Common mistake

Treating everything after the command as an undifferentiated string. That makes it harder to understand what can be changed safely.

Another common mistake is forgetting that the order of arguments can matter. For example, in:

cp source.txt destination.txt

The first argument is the source, and the second is the destination. Reversing them changes the meaning.

Script arguments: $1, $2, and "$@"

Why it matters

Shell scripts often need input values such as a filename, a directory, or a parameter. Instead of hardcoding them, scripts can accept command-line arguments.

How it works

If you run:

bash myscript.sh input.txt output.txt

then inside the script:

$1 is input.txt
$2 is output.txt
"$@" means all arguments passed to the script.

Practical example

#!/bin/bash
echo "Input file: $1"
echo "Output file: $2"

Run it with:

bash myscript.sh data.csv results.txt

Common mistake

Using $1, $2, and so on without checking whether the user actually provided enough arguments.

Writing scripts instead of repeating commands manually

Why it matters

If you repeat the same sequence of commands several times, typing them manually is both slower and more error-prone than writing a small script.

How it works

Put repeated workflows into a shell script. This makes them explicit, reusable, and easier to reproduce later.

Practical example

Instead of repeatedly typing:

source ~/env/bin/activate
python preprocess.py
python analyze.py

put them into run_analysis.sh:

#!/bin/bash
source ~/env/bin/activate
python preprocess.py
python analyze.py

and run:

bash run_analysis.sh

Common mistake

Relying on memory or old shell history as the only documentation of a workflow.

Reproducibility and workflow logging

Why it matters

Scientific computing is not only about running commands successfully once. It is also about understanding and reproducing what was done later.

How it works

Keep scripts, record important parameters, note software versions, and store the project structure deliberately. Small habits go a long way:

Keep run scripts under version control
Write a short README.md in the project directories
Save the exact commands used for important runs
Record software versions when relevant

Useful commands include:

python --version
git rev-parse HEAD

Practical example

After a successful analysis, store the script used, the command line, the Git commit hash, and the environment version to avoid relying on memory.

Common mistake

Treating reproducibility as something to think about only when writing the paper.

Git basics

Why it matters

Version control is one of the most valuable skills many young researchers lack. Even basic Git knowledge makes code, notes, and scripts safer and easier to evolve.

How it works

Git tracks changes to files over time. At a minimum, it is useful for:

scripts
source code
configuration files
small documentation files

It is usually not ideal for very large raw datasets.

Common basic commands include:

git init
git status
git add script.py
git commit -m "Add preprocessing script"

A .gitignore file helps exclude generated junk and large temporary outputs.

Practical example

Initialize a Git repository in your analysis folder and commit scripts and configuration files as they evolve.

Common mistake

Either using no version control at all, or trying to version huge generated datasets directly without planning.

Safer destructive commands

Why it matters

A single destructive command in the wrong place can cause serious damage, especially on remote machines. Good habits reduce that risk.

How it works

Before deleting or moving important things, confirm context:

pwd
ls

Be careful with wildcards. Understand what variables expand to. Test commands on dummy files if unsure.

Sometimes interactive variants are useful:

rm -i file.txt

Practical example

Before deleting a folder on a remote machine:

pwd
ls
rm -r old_results

Those first two commands are a simple but effective safety check.

Common mistake

Typing rm -rf quickly from habit, without confirming location and target first.

PGP keys and signed Git commits

Why it matters

When you sign a Git commit or tag, you add cryptographic proof that your key really created it. This is useful when you want to verify authorship, especially in collaborative or public projects.

For many small personal research projects, unsigned commits are completely fine. But for shared codebases, public repositories, releases, or long-lived scientific software projects, signed commits, and signed tags can be a useful trust and traceability mechanism.

It also helps distinguish between authentication and authorship. SSH keys help you access machines and services. PGP keys help you sign work.

How it works

PGP uses a public/private key pair, just like other public-key systems.

the private key stays on your machine
the public key can be shared

Git can use a PGP key to sign commits and tags. Platforms such as GitHub or GitLab can then show that a commit is verified if the matching public key has been added to your account.

Typical workflow:

Generate a PGP key
Tell Git which key to use
Enable commit signing
Optionally upload the public key to your Git hosting account

A common GnuPG command to generate a key is:

gpg --full-generate-key

List available secret keys:

gpg --list-secret-keys --keyid-format LONG

Configure Git to use one:

git config --global user.signingkey YOURKEYID
git config --global commit.gpgsign true

Then a normal commit will be signed automatically, or you can sign explicitly with:

git commit -S -m "Add preprocessing script"

You can also sign annotated tags, which is often especially useful for releases:

git tag -s v1.0.0 -m "Version 1.0.0"

Practical example

Generate a key:

gpg --full-generate-key

Find the key ID:

gpg --list-secret-keys --keyid-format LONG

Configure Git:

git config --global user.signingkey YOURKEYID
git config --global commit.gpgsign true

Make a signed commit:

git commit -S -m "Add analysis script"

Common mistake

Confusing SSH keys and PGP keys. They solve different problems. SSH keys are mainly for authentication to systems and services. PGP keys are used for signing and verification.

Another common mistake is enabling commit signing before the key is properly configured in Git or uploaded to the hosting platform, which can make verification appear to fail even though the commit was signed locally.

Final thoughts

Linux and remote computing are often presented either as trivial knowledge everyone should already know or as arcane knowledge reserved for specialists. In practice, the truth is simpler. There is a middle ground of practical skills that every scientist benefits from learning.

None of these skills are glamorous. But they are high-leverage. They save time, reduce mistakes, and make you more autonomous across laptops, remote servers, workstations, and clusters.

That is why they are worth learning early and teaching clearly.