pages tagged bioinformaticsDon Armstronghttps://www.donarmstrong.com/tags/bioinformatics/Don Armstrongikiwiki2016-06-15T20:23:44ZBioinformatic Supercomputer Wishlisthttps://www.donarmstrong.com/posts/supercomputer_wishlist/2016-06-15T20:23:44Z2016-06-15T20:22:28Z
<p>Many bioinformatic problems require large amounts of memory and
processor time to complete. For example, running WGCNA across 10⁶ CpG
sites requires 10⁶ choose 2 or 10¹³ comparisons, which needs 10 TB
to store the resulting matrix. While embarrassingly parallel, the
dataset upon which the regressions are calculated is very large, and
cannot fit into main memory of most existing supercomputers, which are
often tuned for small-data fast-interconnect problems.</p>
<p>Another problem which I am interested in is computing ancestral trees
from whole human genomes. This involves running maximum likelihood
calculations across 10⁹ bases and thousands of samples. The matrix
itself could potentially take 1 TB, and calculating the likelihood
across that many positions is computationally expensive. Furthermore,
an exhaustive search of trees for 2000 individuals requires 2000!!
comparisons, or 10²⁸⁶⁸; even searching a small fraction of that
subspace requires lots of computational time.</p>
<p>Some things that a future supercomputer could have that would enable
better solutions to bioinformatic problems include:</p>
<ol>
<li>Fast local storage</li>
<li>Better hierarchical storage with smarter caching. Data should
ideally move easily between local memory, shared memory, local
storage, and remote storage.</li>
<li>Fault-tolerant, storage affinity aware schedulers. </li>
<li>GPUs and/or other coprocessors with larger memory and faster memory
interconnects.</li>
<li>Larger memory (at least on some nodes)</li>
<li>Support for docker (or similar) images. </li>
<li>Better bioinformatics software which can actually take advantage of
advances in computer architecture.</li>
</ol>
Essential Data Science: Githttps://www.donarmstrong.com/posts/essential_data_science_git/2016-05-23T19:50:29Z2016-05-23T19:48:46Z
<p>Having a new student join me to work in the lab reminded me that I
should collect some of the many resources around for getting started
in bioinformatics and any data-based science in general. So towards
this end, one of the first essential tools for any data scientist is a
knowledge of <a href="https://en.wikipedia.org/wiki/Git_(software)">git</a>.</p>
<p>Start first with
<a href="https://try.github.io/levels/1/challenges/1">Code School's simple introduction to git</a>
which gives you the basics of using git from the command line.</p>
<p>Then, check out
<a href="https://www.youtube.com/playlist?list=PL5-da3qGB5IBLMp7LtN8Nc3Efd4hJq0kD">set of lectures on Git and GitHub</a>
which goes into setting up git and using it with github. This is a set
of lectures which was used in a Data Science course.</p>
<p>Finally, I'd check out
<a href="https://help.github.com/articles/good-resources-for-learning-git-and-github/">the set of resources on github</a>
for even more information, and then learn to love the
<a href="https://www.kernel.org/pub/software/scm/git/docs/git.html">git manpages</a>.</p>