Software developers know this already but when working on the latest publication recently, I was reminded that we should document obsessively.
What was challenging in this project is, even though the amount of generated data is not massive (only about equivalent of thousands of spreadsheet rows), a lot of variables are involved in a complicated way. I briefly considered putting them in a pandas dataframe, but it proved to be tedious and impractical. Furthermore, there are many ways to slice it, and in the beginning it was not clear what was best. Different slices give slightly different result, and this is important because the result go to a testing pipeline. So we tried a couple of different slices, tested them, the pandemic happened, and some time later, uhh… which slices exactly did we do again? I thank my past self that although the folder is a bit messy, I retained all the raw data and all the bash scripts for each processing step was right there, from raw, right until beautiful publication figure. Although it did take a while to get my bearings again in the folder, thankfully I was able to retrace my steps again.
Here are some notable points to self:
- Intermediary scripts should have checks built in. Expose the temporary files; don’t hide them in a black box.
- Just text files is not so good for documentation. In this case I had a companion gSheets and jupyter notebook and they really help.
- Put raw data and analysis in different folders (I didn’t do this)
Leave a Comment
Your email address will not be published. Required fields are marked *