Formatting list of SMILES with bash scripting
Published:
Here is a fun text manipulation exercise using bash you can do in less than an hour. Given that I have this text file, file.smi
:
smiles1 some_id_abc
smiles2 some_id_xyz
...
I want to have it like this:
smiles1 C00000001
smiles2 C00000002
...
This was my real-world need of converting a list of SMILES to a format that is accepted by a conversion programme. Looks easy right?
Here was what I did: First I need to generate a sequence ending in total number of lines is file.smi
:
# count lines in file.smi
wc -l < file.smi
# print sequence (1,2,3,...)
seq $(wc -l < file.smi)
# Add zero padding up to 8 characters (0000000x, 000000xx)
seq -f "%08g" $(wc -l < file.smi)
# Add 'C' in the beginning (C0000000x, C000000xx)
seq -f "%08g" $(wc -l < file.smi) | sed 's/^/C/' > new_ids_tempfile
Zero padding is the tricky part; the rest is just column manipulation:
# Put together smiles and new ids, change delimiter from tab to space
paste -d ' ' file.smi new_ids_tempfile
# file.smi may or may not have old ids, so just output first and last columns
# to make sure only smiles and new ids are printed
paste -d ' ' file.smi new_ids_tempfile | awk '{print $1, $NF}'
Finally we can eliminate the intermediary tempfile by generating it on the fly:
paste -d ' ' file.smi <(seq -f "%08g" $(wc -l < file.smi) | sed 's/^/C/') | awk '{print $1, $NF}'
As with many one-liner text manipulation scripts, this is is pretty unintelligible at first glance. To do my future self a favour, I add extensive comments:
paste -d ' ' file.smi <(seq -f "%08g" $(wc -l < file.smi) | sed 's/^/C/') | awk '{print $1, $NF}'
# |--count lines in file.smi
# |--make sequences up to count, pad with zero up to 8 char width
# |--add 'C' at beginning
# |--put the new ids C00000001,... in a temp file
#|--paste together smiles and new ids, delimiter is single space
# |--finally, only output
# first and last columns
Hopefully this still makes sense in the future \(“▔□▔)/ *cross fingers*
Update: I found a nice website (explainshell.com) which can parse complicated one-liner shell commands like the above. Here is the parsing of the above.
Leave a Comment
Your email address will not be published. Required fields are marked *