Good practice for bash scripting

6 minute read

Published:

I will illustrate some good practices of writing bash scripts by showing you how I write and refactor my bash script dock.sh that does some preparation for docking and then launch docking.

  1. Outline with pseudocode
    What is it that you want the script to do? Our goal is to translate it to what the machine can understand. Here I have divided my script into 3 chunks, which are the sequence that I want my script to execute:
    main(){
    process_protein
    make_dock_input
    run_dock
    }
       
    process_protein(){
    loop over unprocessed proteins
    }
       
    make_dock_input(){
    need ligand path
    need protein paths --> loop
    }
       
    run_dock(){
    call docking programme
    }
       
    main
    

    Here I have defined a function main which overarches the 3 other functions. I keep definition of main at the top of the file for easy editing. The script is also structured like this such that we can think of the script at higher level – that there are 3 things that the script does – without needing to worry about the nitty-gritty inner workings of those 3 things. This concept is called abstraction. If need be, we can always zoom in and go to the lower level.

    It also makes it easy to comment out a whole functional block by just twiddling with main, like so:

     main(){
     process_protein
     make_dock_input
     #run_dock
     }
    

    Good-practice-wise, the function definition should be indented. Python would strictly enforce this, but not bash. I find my lack of indentation does not affect readability (and my script will have no nested functions where proper indentation becomes more important), so in this case I prefer no indentation.

     # with proper indentation
     main(){
         process_protein
         make_dock_input
         run_dock
     }
    
  2. If the outline looks alright, replace the pseudocode with the real code.
    I am tackling the easiest first, run_dock (let’s pretend the docking executable is docking_programme):

     run_dock(){
     docking_programme dock.inp
     }
    

    Then process_protein (assume pdb_to_dck_converter is a programme that does cleanups and preps for docking and outputs dock-ready file .dck):

     process_protein(){
     for pdb in *.pdb; do
         echo $i
         # basename replaces extension
         pdb_to_dck_converter -in $pdb -out $(basename -s .pdb $pdb).dck
     done
     }
    

    Constructing for loop is easy. We iterate for all .pdb files in the current directory (make sure the only .pdb files in the directory are the unprocessed ones). Note the deliberate indentation. echo is always good for future debugging. basename strips .pdb extension, to which we add .dck extension. If you think a line is not self-explanatory, write a comment.

    Now here is a trickier one: make_dock_input. Starting with a barebone template, I need to supply ligand and protein paths. The template looks like this:

     # template.dock.inp
     protein_paths	
     ligand_path	
     SOME_OTHER_PARAMETERS
    

    While I need the final input to look like this:

     # dock.inp
     PROTEIN /path/protein1.dck
     PROTEIN /path/protein2.dck
     PROTEIN /path/protein3.dck
     ...
     LIGAND /path/ligand.lig	
     SOME_OTHER_PARAMETERS
    

    At first, I thought a simple sed substitution will do:

     # doesn't work
     ligand_path="/path/ligand.lig"
     sed "s/ligand_path/$ligand_path" template.dock.inp > dock.inp
    

    But no, there are some illegal characters in $ligand_path. There are some clever cleanups or escaping that can be done, but in the end I stick to simple insertion and deletion. Cleanups will result in unnecessarily more lines and using escape characters will make the code much less readable.

     # insert at the line before 'placeholder' ligand_path
     ligand_path="/path/ligand.lig"
     echo $ligand_path
     sed "ligand_path/i $ligand_path" template.dock.inp > dock.inp
     # delete 'placeholder' ligand_path
     sed -i '/ligand_path/d' dock.inp
    

    How about protein paths? The logic is the same as ligand path; just wrap it in a loop:

     for pdb in *.pdb; do
         # insert at the line before 'placeholder' protein_path
         dck=$(basename -s .pdb $pdb).dck
         protein_path="PROTEIN /path/$dck"
         echo $protein_path
         sed -i "/protein_paths/i $protein_path" dock.inp
     done
     # delete 'placeholder' protein_paths
     sed -i '/protein_paths/d' dock.inp
    

    Combine them:

     make_dock_input(){
     # insert at the line before 'placeholder' ligand_path
     ligand_path="/path/ligand.lig"
     echo $ligand_path
     sed "ligand_path/i $ligand_path" template.dock.inp > dock.inp
     # delete 'placeholder' ligand_path
     sed -i '/ligand_path/d' dock.inp
        
     for pdb in *.pdb; do
         # insert at the line before 'placeholder' protein_path
         dck=$(basename -s .pdb $pdb).dck
         protein_path="PROTEIN /path/$dck"
         echo $protein_path
         sed -i "/protein_path/i $protein_path" dock.inp
     done
     # delete 'placeholder' protein_path
     sed -i '/protein_paths/d' dock.inp
     }
    
  3. Debug and refactor
    Test the script. Does it do what you want to do? If it does, great! But your work is not done yet. You should put some effort in refactoring, which does not change the script output, but beautify the source code, so to speak. What you want is source code that is readable and maintainable. For a start, you may want to combine some lines:
    sed -i '/ligand_path/d' dock.inp
    sed -i '/protein_paths/d' dock.inp
    # one line
    sed -i '/ligand_path/d; /protein_paths/d' dock.inp
    

    Still very much readable, isn’t it?

    Now take a look at our function and variable names. Are there unclear or meaningless ones like $x or $i? Though there is nothing wrong with $i for a throwaway variable in a for loop for instance, replacing it with a more explicit one is better. Compare:

    for i in *pdb; do
        #do things with $i
        ...
    done
    

    to:

    for pdb in *pdb; do
        #do things with $pdb
        ...
    done
    

    Or even:

    for pdb_file in *pdb; do
        #do things with $pdb_file
        ...
    done
    

    Another example – you might decide that this

    sed -i "/protein_path/i $protein_path" dock.inp
    

    is a little bit confusing since protein_path is repeated. This might be better:

    # more explicit
    sed -i "/protein_path_placeholder/i $protein_path" dock.inp
    

    Lastly, add more comments, which is valuable documentation, especially when future you digs out this script again in far future! Take a look again at how I wrote the comments for insertion and deletion steps in make_dock_input. On the other hand, do not overcomment.

    I read somewhere that typically for a given code one writes 10 times as much code! That should give you some idea of how much editing is done.

BONUS: how to specify a dynamic list

This is more of a tip than good practice. I often find myself needing to do things to a list of proteins, but I want to exclude and include some of them with flexibility. In the example above, for pdb in *pdb will iterate over all .pdb files in the directory, so the list is fixed.

The solution that I like is bash array:

# quotation marks are optional
pdbs=(
protein_a.pdb
protein_b.pdb
#protein_c.pdb # easy excluding by commenting
protein_d.pdb
)

The only drawback is that calling the array is not something I can do off the top of my head:

for pdb in "${pdbs[@]}"; do
    ...
done

If you are calling the list from several different scripts, put the array into a master list file list.sh. In the scripts you just need to include source /path/list.sh. Now you just need to edit the master list once.

Leave a Comment

Your email address will not be published. Required fields are marked *

Loading...