Good practice for bash scripting

6 minute read

Published: December 18, 2018

I will illustrate some good practices of writing bash scripts by showing you how I write and refactor my bash script dock.sh that does some preparation for docking and then launch docking.

Outline with pseudocode
What is it that you want the script to do? Our goal is to translate it to what the machine can understand. Here I have divided my script into 3 chunks, which are the sequence that I want my script to execute:
```
main(){
process_protein
make_dock_input
run_dock
}
   
process_protein(){
loop over unprocessed proteins
}
   
make_dock_input(){
need ligand path
need protein paths --> loop
}
   
run_dock(){
call docking programme
}
   
main
```
Here I have defined a function main which overarches the 3 other functions. I keep definition of main at the top of the file for easy editing. The script is also structured like this such that we can think of the script at higher level – that there are 3 things that the script does – without needing to worry about the nitty-gritty inner workings of those 3 things. This concept is called abstraction. If need be, we can always zoom in and go to the lower level.
It also makes it easy to comment out a whole functional block by just twiddling with main, like so:
```
 main(){
 process_protein
 make_dock_input
 #run_dock
 }
```
Good-practice-wise, the function definition should be indented. Python would strictly enforce this, but not bash. I find my lack of indentation does not affect readability (and my script will have no nested functions where proper indentation becomes more important), so in this case I prefer no indentation.
```
 # with proper indentation
 main(){
     process_protein
     make_dock_input
     run_dock
 }
```

If the outline looks alright, replace the pseudocode with the real code.
I am tackling the easiest first, run_dock (let’s pretend the docking executable is docking_programme):

 run_dock(){
 docking_programme dock.inp
 }

Then process_protein (assume pdb_to_dck_converter is a programme that does cleanups and preps for docking and outputs dock-ready file .dck):

 process_protein(){
 for pdb in *.pdb; do
     echo $i
     # basename replaces extension
     pdb_to_dck_converter -in $pdb -out $(basename -s .pdb $pdb).dck
 done
 }

Constructing for loop is easy. We iterate for all .pdb files in the current directory (make sure the only .pdb files in the directory are the unprocessed ones). Note the deliberate indentation. echo is always good for future debugging. basename strips .pdb extension, to which we add .dck extension. If you think a line is not self-explanatory, write a comment.

Now here is a trickier one: make_dock_input. Starting with a barebone template, I need to supply ligand and protein paths. The template looks like this:

 # template.dock.inp
 protein_paths	
 ligand_path	
 SOME_OTHER_PARAMETERS

While I need the final input to look like this:

 # dock.inp
 PROTEIN /path/protein1.dck
 PROTEIN /path/protein2.dck
 PROTEIN /path/protein3.dck
 ...
 LIGAND /path/ligand.lig	
 SOME_OTHER_PARAMETERS

At first, I thought a simple sed substitution will do:

 # doesn't work
 ligand_path="/path/ligand.lig"
 sed "s/ligand_path/$ligand_path" template.dock.inp > dock.inp

But no, there are some illegal characters in $ligand_path. There are some clever cleanups or escaping that can be done, but in the end I stick to simple insertion and deletion. Cleanups will result in unnecessarily more lines and using escape characters will make the code much less readable.

 # insert at the line before 'placeholder' ligand_path
 ligand_path="/path/ligand.lig"
 echo $ligand_path
 sed "ligand_path/i $ligand_path" template.dock.inp > dock.inp
 # delete 'placeholder' ligand_path
 sed -i '/ligand_path/d' dock.inp

How about protein paths? The logic is the same as ligand path; just wrap it in a loop:

 for pdb in *.pdb; do
     # insert at the line before 'placeholder' protein_path
     dck=$(basename -s .pdb $pdb).dck
     protein_path="PROTEIN /path/$dck"
     echo $protein_path
     sed -i "/protein_paths/i $protein_path" dock.inp
 done
 # delete 'placeholder' protein_paths
 sed -i '/protein_paths/d' dock.inp

Combine them:

 make_dock_input(){
 # insert at the line before 'placeholder' ligand_path
 ligand_path="/path/ligand.lig"
 echo $ligand_path
 sed "ligand_path/i $ligand_path" template.dock.inp > dock.inp
 # delete 'placeholder' ligand_path
 sed -i '/ligand_path/d' dock.inp
    
 for pdb in *.pdb; do
     # insert at the line before 'placeholder' protein_path
     dck=$(basename -s .pdb $pdb).dck
     protein_path="PROTEIN /path/$dck"
     echo $protein_path
     sed -i "/protein_path/i $protein_path" dock.inp
 done
 # delete 'placeholder' protein_path
 sed -i '/protein_paths/d' dock.inp
 }

Debug and refactor
Test the script. Does it do what you want to do? If it does, great! But your work is not done yet. You should put some effort in refactoring, which does not change the script output, but beautify the source code, so to speak. What you want is source code that is readable and maintainable. For a start, you may want to combine some lines:
```
sed -i '/ligand_path/d' dock.inp
sed -i '/protein_paths/d' dock.inp
# one line
sed -i '/ligand_path/d; /protein_paths/d' dock.inp
```
Still very much readable, isn’t it?
Now take a look at our function and variable names. Are there unclear or meaningless ones like $x or $i? Though there is nothing wrong with $i for a throwaway variable in a for loop for instance, replacing it with a more explicit one is better. Compare:
```
for i in *pdb; do
    #do things with $i
    ...
done
```
to:
```
for pdb in *pdb; do
    #do things with $pdb
    ...
done
```
Or even:
```
for pdb_file in *pdb; do
    #do things with $pdb_file
    ...
done
```
Another example – you might decide that this
```
sed -i "/protein_path/i $protein_path" dock.inp
```
is a little bit confusing since protein_path is repeated. This might be better:
```
# more explicit
sed -i "/protein_path_placeholder/i $protein_path" dock.inp
```
Lastly, add more comments, which is valuable documentation, especially when future you digs out this script again in far future! Take a look again at how I wrote the comments for insertion and deletion steps in make_dock_input. On the other hand, do not overcomment.
I read somewhere that typically for a given code one writes 10 times as much code! That should give you some idea of how much editing is done.

BONUS: how to specify a dynamic list

This is more of a tip than good practice. I often find myself needing to do things to a list of proteins, but I want to exclude and include some of them with flexibility. In the example above, for pdb in *pdb will iterate over all .pdb files in the directory, so the list is fixed.

The solution that I like is bash array:

# quotation marks are optional
pdbs=(
protein_a.pdb
protein_b.pdb
#protein_c.pdb # easy excluding by commenting
protein_d.pdb
)

The only drawback is that calling the array is not something I can do off the top of my head:

for pdb in "${pdbs[@]}"; do
    ...
done

If you are calling the list from several different scripts, put the array into a master list file list.sh. In the scripts you just need to include source /path/list.sh. Now you just need to edit the master list once.

Share on

Twitter Facebook Google+ LinkedIn

Your email address will not be published. Required fields are marked *

Link roundup: Jan–Jun 2023

less than 1 minute read

Published: June 30, 2023

Science
Go Ahead, Try to Explain Milk
Killer Heat Waves Are Coming
Science Shows Why Traditional Kimchi Making Works So Well
A New Approach to Computation Reimagines Artificial Intelligence
The Computer Scientist Peering Inside AI’s Black Boxes

Others
The Case Against Travel
How to Keep Life from Becoming a Parody of Itself: Simone de Beauvoir on the Art of Growing Older
Is Wine Fake?
The Sound of Home: Sonorous Desert by Kim Haines-Eitzen
The Meaning of Life
The Dao of Using Your Smartphone
Camus’s Atheism and the Virtues of Inconsistency
Fatal Distraction: Forgetting a Child in the Backseat of a Car Is a Horrifying Mistake. Is It a Crime?

Link roundup: Apr–Dec 2022

1 minute read

Published: December 31, 2022

Science
Wood spirits: How Japan made the world’s first liquor from trees
The price of ‘sugar free’: are sweeteners as harmless as we thought?
A language model beats alphafold2 on orphans
https://github.com/FellowsFreiesWissen/computational_notebooks Why Conventional Wisdom About Cancer Can Be Misleading
Machine Learning to Handle the Proteome
‘The entire protein universe’: AI predicts shape of nearly every known protein
Could machine learning fuel a reproducibility crisis in science?
Blots on a field?
PNAS | Leveraging nonstructural data to predict structures and affinities of protein–ligand complexes
Breaking into the black box of artificial intelligence

Others
When a Houseplant Obsession Becomes a Nightmare
Book Review: What We Owe The Future
If Someone Is Typing, Then Stops … Can I Ask Why?

Link roundup: Jan–Mar 2022

2 minute read

Published: March 30, 2022

Since these AIs are just giant matrix multiplication machines, “intuition” now has a firm grounding in math - just much bigger, more complicated math than the usual kind that we call “logical”.

This would be a common pattern for sciences: much worse at everyday tasks than people who do them intuitively, until it generates some surprising and powerful new technology. Democritus figured out what matter was made of in 400 BC, and it didn’t help a single person do a single useful thing with matter for the next 2000 years of followup research, and then you got the atomic bomb (I may be skipping over all of chemistry, sorry).
– What Are We Arguing About When We Argue About Rationality?

What he seeks to practice is, in a phrase popularized by the Marxist philosopher Antonio Gramsci, “pessimism of the intellect, optimism of the will.”
– Can Science Fiction Wake Us Up to Our Climate Reality?

Caulfield then introduced two different ways of thinking about how we engage with ideas when we’re on the internet: The web as a garden and the web as a stream. Think of the web as an organically developing garden: a space in which there’s no predetermined order or relationship of things to one another. Caulfield writes, “Every walk through the garden creates new paths, new meanings.” What came first in the garden doesn’t matter either. Each thing in the garden is related to the other things as it exists in the moment.
– The Faithful Gardener

Science
Dual use of artificial-intelligence-powered drug discovery
Twelve quick tips for software design
Computer Scientists Prove Why Bigger Neural Networks Do Better
Failing the test: DNA barcoding brought botanist Steven Newmaster scientific fame and entrepreneurial success. Was it all based on fraud?
What’s the buzz? Let’s talk about numbing ingredients
The pandemic’s true death toll: millions more than official counts
5 nutrition goals that are better than weight loss

Others
https://github.com/csinva/imodels
Synaesthetics
Transformative Experience and Pascal’s Wager
Do Good Doorbell Cams Make Good Neighbors?
How to Want Less
It’s Your Friends Who Break Your Heart
How to be useless
What We Don’t Want to Know
It’s Time for Some Game Theory
Why does woman have ‘man’ in it and female has the word ‘male’ in it?

2021 book list

1 minute read

Published: January 05, 2022

Compared to 2020, I slowly regained my reading habit again. One notable book is Camus’ The Plague that I didn’t finish in 2020, but picked it up again. Nothing really stood out for me, but I would recommend the starred ones.

Yossa Dwi Hartono