# Formatting list of SMILES with bash scripting

Published:

Here is a fun text manipulation exercise using bash you can do in less than an hour. Given that I have this text file, file.smi:

smiles1 some_id_abc
smiles2 some_id_xyz
...


I want to have it like this:

smiles1 C00000001
smiles2 C00000002
...


This was my real-world need of converting a list of SMILES to a format that is accepted by a conversion programme. Looks easy right?

Here was what I did: First I need to generate a sequence ending in total number of lines is file.smi:

# count lines in file.smi
wc -l < file.smi
# print sequence (1,2,3,...)
seq $(wc -l < file.smi) # Add zero padding up to 8 characters (0000000x, 000000xx) seq -f "%08g"$(wc -l < file.smi)
# Add 'C' in the beginning (C0000000x, C000000xx)
seq -f "%08g" $(wc -l < file.smi) | sed 's/^/C/' > new_ids_tempfile  Zero padding is the tricky part; the rest is just column manipulation: # Put together smiles and new ids, change delimiter from tab to space paste -d ' ' file.smi new_ids_tempfile # file.smi may or may not have old ids, so just output first and last columns # to make sure only smiles and new ids are printed paste -d ' ' file.smi new_ids_tempfile | awk '{print$1, $NF}'  Finally we can eliminate the intermediary tempfile by generating it on the fly: paste -d ' ' file.smi <(seq -f "%08g"$(wc -l < file.smi) | sed 's/^/C/') | awk '{print $1,$NF}'


As with many one-liner text manipulation scripts, this is is pretty unintelligible at first glance. To do my future self a favour, I add extensive comments:

paste -d ' ' file.smi <(seq -f "%08g" $(wc -l < file.smi) | sed 's/^/C/') | awk '{print$1, \$NF}'
#                                       |--count lines in file.smi
#                       |--make sequences up to count, pad with zero up to 8 char width
#                     |--put the new ids C00000001,... in a temp file
#|--paste together smiles and new ids, delimiter is single space
#                                                                         |--finally, only output
#                                                                            first and last columns


Hopefully this still makes sense in the future \(“▔□▔)/ *cross fingers*

Update: I found a nice website (explainshell.com) which can parse complicated one-liner shell commands like the above. Here is the parsing of the above.

Tags:

Categories: