pages tagged biologyDon Armstronghttps://www.donarmstrong.com/tags/biology/Don Armstrongikiwiki2017-05-12T20:22:04ZCore Transcriptome of Mammalian Placentashttps://www.donarmstrong.com/posts/placenta_viviparous_paper/2017-05-12T20:22:04Z2017-05-12T20:20:49Z
<p>Our
<a href="http://www.sciencedirect.com/science/article/pii/S0143400417302369">paper which describes the components of the placenta transcriptome which are conserved among all placental mammals</a> in
<em>Placenta</em> just came out today. More importantly than the results and
the text of the paper, though, is the fact that all of the code and
results of this paper, from the very first work I did two years ago to
its publication today is present in git, and (in theory) reproducible.</p>
<p>You can see where our paper was rejected from
<em><a href="https://github.com/uiuc-cgm/placenta-viviparous/tree/genome_biology_submission_4202016">Genome Biology</a></em>
and
<em><a href="https://github.com/uiuc-cgm/placenta-viviparous/tree/genes_dev_submission">Genes and development</a></em> and
<a href="https://github.com/uiuc-cgm/placenta-viviparous/compare/genes_dev_submission...master#diff-dfaf04b448930bc40fe8b2907f2c7223">radically refocused before submission to
<em>Placenta</em></a>.
But more importantly, you can know where every single result which is
mentioned in the paper came from, the precise code to generate it, and
how we came to the final paper which was published. [And you've also
got all of the hooks to branch off from our analysis to do your own
analysis based on our data!]</p>
<p>This is what open, reproducible science should look like.</p>
Shrinking lists of gene names in Rhttps://www.donarmstrong.com/posts/shrinking_gene_names/2017-03-23T20:57:21Z2017-03-23T20:45:44Z
<p>I've been trying to finish a paper where I compare gene expression in
14 different placentas. One of the supplemental figures compares
median expression in gene trees across all 14 species, but because
tree ids like
<a href="http://www.ensembl.org/Multi/GeneTree/Image?gt=ENSGT00840000129673">ENSGT00840000129673</a>
aren't very expressive, and names like
"COL11A2, COL5A3, COL4A1, COL1A1, COL2A1, COL1A2, COL4A6, COL4A5,
COL7A1, COL27A1, COL11A1, COL4A4, COL4A3, COL3A1, COL4A2, COL5A2,
COL5A1, COL24A1" take up too much space, I wanted a function which could
collapse the gene names into something which uses bash glob syntax to
more succinctly list the gene names, like:
COL{11A{1,2},1A{1,2},24A1,27A1,2A1,3A1,4A{1,2,3,4,5,6},5A{1,2,3},7A1}.</p>
<p>Thus, a crazy function which uses <code>lcprefix</code> from <code>Biostrings</code> and
some looping was born:</p>
<div class="highlight-r"><pre class="hl">collapse<span class="hl opt">.</span>gene<span class="hl opt">.</span>names <span class="hl opt"><-</span> <span class="hl kwa">function</span><span class="hl opt">(</span>x<span class="hl opt">,</span>min<span class="hl opt">.</span>collapse<span class="hl opt">=</span><span class="hl num">2</span><span class="hl opt">) {</span>
<span class="hl slc">## longest common substring</span>
<span class="hl kwa">if</span> <span class="hl opt">(</span>is<span class="hl opt">.</span><span class="hl kwd">null</span><span class="hl opt">(</span>x<span class="hl opt">) ||</span> <span class="hl kwd">length</span><span class="hl opt">(</span>x<span class="hl opt">)==</span><span class="hl num">0</span><span class="hl opt">) {</span>
<span class="hl kwd">return</span><span class="hl opt">(</span>as<span class="hl opt">.</span><span class="hl kwd">character</span><span class="hl opt">(</span><span class="hl kwb">NA</span><span class="hl opt">))</span>
<span class="hl opt">}</span>
x <span class="hl opt"><-</span> <span class="hl kwd">sort</span><span class="hl opt">(</span><span class="hl kwd">unique</span><span class="hl opt">(</span>x<span class="hl opt">))</span>
str_collapse <span class="hl opt"><-</span> <span class="hl kwa">function</span><span class="hl opt">(</span>y<span class="hl opt">,</span>len<span class="hl opt">) {</span>
<span class="hl kwa">if</span> <span class="hl opt">(</span>len <span class="hl opt">==</span> <span class="hl num">1</span> <span class="hl opt">||</span> <span class="hl kwd">length</span><span class="hl opt">(</span>y<span class="hl opt">) <</span> <span class="hl num">2</span><span class="hl opt">) {</span>
<span class="hl kwd">return</span><span class="hl opt">(</span>y<span class="hl opt">)</span>
<span class="hl opt">}</span>
y<span class="hl opt">.</span>tree <span class="hl opt"><-</span>
<span class="hl kwd">gsub</span><span class="hl opt">(</span><span class="hl kwd">paste0</span><span class="hl opt">(</span><span class="hl str">"^(.{"</span><span class="hl opt">,</span>len<span class="hl opt">,</span><span class="hl str">"}).*$"</span><span class="hl opt">),</span><span class="hl str">"</span><span class="hl esc">\\</span><span class="hl str">1"</span><span class="hl opt">,</span>y<span class="hl opt">[</span><span class="hl num">1</span><span class="hl opt">])</span>
y<span class="hl opt">.</span>rem <span class="hl opt"><-</span>
<span class="hl kwd">gsub</span><span class="hl opt">(</span><span class="hl kwd">paste0</span><span class="hl opt">(</span><span class="hl str">"^.{"</span><span class="hl opt">,</span>len<span class="hl opt">,</span><span class="hl str">"}"</span><span class="hl opt">),</span><span class="hl str">""</span><span class="hl opt">,</span>y<span class="hl opt">)</span>
y<span class="hl opt">.</span>rem<span class="hl opt">.</span>prefix <span class="hl opt"><-</span>
<span class="hl kwd">sum</span><span class="hl opt">(</span><span class="hl kwd">combn</span><span class="hl opt">(</span>y<span class="hl opt">.</span>rem<span class="hl opt">,</span><span class="hl num">2</span><span class="hl opt">,</span><span class="hl kwa">function</span><span class="hl opt">(</span>x<span class="hl opt">){</span>Biostrings<span class="hl opt">::</span><span class="hl kwd">lcprefix</span><span class="hl opt">(</span>x<span class="hl opt">[</span><span class="hl num">1</span><span class="hl opt">],</span>x<span class="hl opt">[</span><span class="hl num">2</span><span class="hl opt">])}) >=</span> <span class="hl num">2</span><span class="hl opt">)</span>
<span class="hl kwa">if</span> <span class="hl opt">(</span><span class="hl kwd">length</span><span class="hl opt">(</span>y<span class="hl opt">.</span>rem<span class="hl opt">) ></span> <span class="hl num">3</span> <span class="hl opt">&&</span>
y<span class="hl opt">.</span>rem<span class="hl opt">.</span>prefix <span class="hl opt">>=</span> <span class="hl num">2</span>
<span class="hl opt">) {</span>
y<span class="hl opt">.</span>rem <span class="hl opt"><-</span>
collapse<span class="hl opt">.</span>gene<span class="hl opt">.</span><span class="hl kwd">names</span><span class="hl opt">(</span>y<span class="hl opt">.</span>rem<span class="hl opt">,</span>min<span class="hl opt">.</span>collapse<span class="hl opt">=</span><span class="hl num">1</span><span class="hl opt">)</span>
<span class="hl opt">}</span>
<span class="hl kwd">paste0</span><span class="hl opt">(</span>y<span class="hl opt">.</span>tree<span class="hl opt">,</span>
<span class="hl str">"{"</span><span class="hl opt">,</span><span class="hl kwd">paste</span><span class="hl opt">(</span>collapse<span class="hl opt">=</span><span class="hl str">","</span><span class="hl opt">,</span>
y<span class="hl opt">.</span>rem<span class="hl opt">),</span><span class="hl str">"}"</span><span class="hl opt">)</span>
<span class="hl opt">}</span>
i <span class="hl opt"><-</span> <span class="hl num">1</span>
ret <span class="hl opt"><-</span> <span class="hl kwb">NULL</span>
<span class="hl kwa">while</span> <span class="hl opt">(</span>i <span class="hl opt"><=</span> <span class="hl kwd">length</span><span class="hl opt">(</span>x<span class="hl opt">)) {</span>
col<span class="hl opt">.</span>pmin <span class="hl opt"><-</span>
<span class="hl kwd">pmin</span><span class="hl opt">(</span><span class="hl kwd">sapply</span><span class="hl opt">(</span>x<span class="hl opt">,</span>Biostrings<span class="hl opt">::</span>lcprefix<span class="hl opt">,</span>x<span class="hl opt">[</span>i<span class="hl opt">]))</span>
collapseable <span class="hl opt"><-</span>
<span class="hl kwd">which</span><span class="hl opt">(</span>col<span class="hl opt">.</span>pmin <span class="hl opt">></span> min<span class="hl opt">.</span>collapse<span class="hl opt">)</span>
<span class="hl kwa">if</span> <span class="hl opt">(</span><span class="hl kwd">length</span><span class="hl opt">(</span>collapseable<span class="hl opt">) ==</span> <span class="hl num">0</span><span class="hl opt">) {</span>
ret <span class="hl opt"><-</span> <span class="hl kwd">c</span><span class="hl opt">(</span>ret<span class="hl opt">,</span>x<span class="hl opt">[</span>i<span class="hl opt">])</span>
i <span class="hl opt"><-</span> i<span class="hl opt">+</span><span class="hl num">1</span>
<span class="hl opt">}</span> <span class="hl kwa">else</span> <span class="hl opt">{</span>
ret <span class="hl opt"><-</span> <span class="hl kwd">c</span><span class="hl opt">(</span>ret<span class="hl opt">,</span>
<span class="hl kwd">str_collapse</span><span class="hl opt">(</span>x<span class="hl opt">[</span>collapseable<span class="hl opt">],</span>
<span class="hl kwd">min</span><span class="hl opt">(</span>col<span class="hl opt">.</span>pmin<span class="hl opt">[</span>collapseable<span class="hl opt">]))</span>
<span class="hl opt">)</span>
i <span class="hl opt"><-</span> <span class="hl kwd">max</span><span class="hl opt">(</span>collapseable<span class="hl opt">)+</span><span class="hl num">1</span>
<span class="hl opt">}</span>
<span class="hl opt">}</span>
<span class="hl kwd">return</span><span class="hl opt">(</span><span class="hl kwd">paste0</span><span class="hl opt">(</span>collapse<span class="hl opt">=</span><span class="hl str">","</span><span class="hl opt">,</span>ret<span class="hl opt">))</span>
<span class="hl opt">}</span>
</pre></div>
Simons Genome Diversityhttps://www.donarmstrong.com/presentations/genome_diversity_2016/2016-10-20T17:31:51Z2016-10-19T20:42:06Z
<p>This is a paper talk which I presented on October 20th at HPC Bio at UIUC.</p>
<ul>
<li><a href="http://git.donarmstrong.com/presentations/simons_genome_diversity.git">Code for slides in git</a></li>
<li><a href="https://www.donarmstrong.com/ld/gd2016/simons_genome_diversity_oct_2016.pdf">PDF of slides</a></li>
</ul>
H3ABioNet Hackathon (Workflows)https://www.donarmstrong.com/posts/h3a_bionet_worfklowhackathon/2016-08-24T14:43:16Z2016-08-24T14:40:04Z
<p>I'm in Pretoria, South Africa at the
<a href="http://h3abionet.org/">H3ABioNet</a>
<a href="http://h3abionet.org/17-h3abionet-courses/h3abionet-courses-upcoming/266-h3abionet-cloud-computing-hackathon">hackathon</a>
which is developing workflows for Illumina chip genotyping,
imputation, 16S rRNA sequencing, and population structure/association
testing. Currently, I'm working with the imputation stream and we're
using <a href="https://www.nextflow.io/">Nextflow</a> to deploy an
<a href="https://mathgen.stats.ox.ac.uk/impute/impute_v2.html">IMPUTE</a>-based
imputation workflow with Docker and
<a href="https://wiki.ncsa.illinois.edu/display/NEBULA/Nebula+Home">NCSA's openstack-based cloud (Nebula)</a>
underneath.</p>
<p>The OpenStack command line clients (<code>nova</code> and <code>cinder</code>) seem to be
pretty usable to
<a href="https://github.com/h3abionet/chipimputation/blob/master/openstack/generate_openstack">automate bringing up a fleet of VMs</a>
and the cloud-init package which is present in the images makes
<a href="https://github.com/h3abionet/chipimputation/tree/master/openstack">configuring the images pretty simple</a>.</p>
<p>Now if I just knew of a better shared object store which was supported
by Nextflow in OpenStack besides mounting an NFS share, things would
be better.</p>
<p>You can follow our progress in our git repo:
[https://github.com/h3abionet/chipimputation]</p>
Mash: fast MinHash based k-mer sequence comparisonhttps://www.donarmstrong.com/presentations/minhash_2016/2016-06-23T18:54:27Z2016-06-23T18:54:27Z
<p>This is a paper talk which I presented on June 23rd at HPC Bio at UIUC.</p>
<ul>
<li><a href="http://git.donarmstrong.com/mash_minhash_presentation.git">Code for slides in git</a></li>
<li><a href="http://www.donarmstrong.com/ld/minhash2016/hpcbio_mash_minhash_jun_2016.pdf">PDF of slides</a></li>
</ul>
Bioinformatic Supercomputer Wishlisthttps://www.donarmstrong.com/posts/supercomputer_wishlist/2016-06-15T20:23:44Z2016-06-15T20:22:28Z
<p>Many bioinformatic problems require large amounts of memory and
processor time to complete. For example, running WGCNA across 10⁶ CpG
sites requires 10⁶ choose 2 or 10¹³ comparisons, which needs 10 TB
to store the resulting matrix. While embarrassingly parallel, the
dataset upon which the regressions are calculated is very large, and
cannot fit into main memory of most existing supercomputers, which are
often tuned for small-data fast-interconnect problems.</p>
<p>Another problem which I am interested in is computing ancestral trees
from whole human genomes. This involves running maximum likelihood
calculations across 10⁹ bases and thousands of samples. The matrix
itself could potentially take 1 TB, and calculating the likelihood
across that many positions is computationally expensive. Furthermore,
an exhaustive search of trees for 2000 individuals requires 2000!!
comparisons, or 10²⁸⁶⁸; even searching a small fraction of that
subspace requires lots of computational time.</p>
<p>Some things that a future supercomputer could have that would enable
better solutions to bioinformatic problems include:</p>
<ol>
<li>Fast local storage</li>
<li>Better hierarchical storage with smarter caching. Data should
ideally move easily between local memory, shared memory, local
storage, and remote storage.</li>
<li>Fault-tolerant, storage affinity aware schedulers. </li>
<li>GPUs and/or other coprocessors with larger memory and faster memory
interconnects.</li>
<li>Larger memory (at least on some nodes)</li>
<li>Support for docker (or similar) images. </li>
<li>Better bioinformatics software which can actually take advantage of
advances in computer architecture.</li>
</ol>
Essential Data Science: Githttps://www.donarmstrong.com/posts/essential_data_science_git/2016-05-23T19:50:29Z2016-05-23T19:48:46Z
<p>Having a new student join me to work in the lab reminded me that I
should collect some of the many resources around for getting started
in bioinformatics and any data-based science in general. So towards
this end, one of the first essential tools for any data scientist is a
knowledge of <a href="https://en.wikipedia.org/wiki/Git_(software)">git</a>.</p>
<p>Start first with
<a href="https://try.github.io/levels/1/challenges/1">Code School's simple introduction to git</a>
which gives you the basics of using git from the command line.</p>
<p>Then, check out
<a href="https://www.youtube.com/playlist?list=PL5-da3qGB5IBLMp7LtN8Nc3Efd4hJq0kD">set of lectures on Git and GitHub</a>
which goes into setting up git and using it with github. This is a set
of lectures which was used in a Data Science course.</p>
<p>Finally, I'd check out
<a href="https://help.github.com/articles/good-resources-for-learning-git-and-github/">the set of resources on github</a>
for even more information, and then learn to love the
<a href="https://www.kernel.org/pub/software/scm/git/docs/git.html">git manpages</a>.</p>
The evolution of gene expression in the term placenta of viviparous mammalshttps://www.donarmstrong.com/posters/placenta_viviparous_2016/2016-05-10T22:26:45Z2016-04-27T19:11:55Z
<p>This is a poster which was presented at the
<a href="http://individualizingmedicineconference.mayo.edu/">IGB Fellows Symposium</a>
and the <a href="http://www.knoweng.org/content/events-0">KnowEng 2016 EAC meeting</a>.</p>
<ul>
<li><a href="https://github.com/uiuc-cgm/placenta-viviparous">Poster code in git</a></li>
<li><a href="http://www.donarmstrong.com/ld/pv2016/placenta_viviparous_poster.pdf">PDF of poster</a></li>
<li><a href="http://www.donarmstrong.com/ld/pv2016/placenta_viviparous_presentation.pdf">Slides of a talk on this work</a></li>
</ul>
<p>This represents a work which is currently under consideration for publication.</p>
DIAMOND: Fast protein alignmenthttps://www.donarmstrong.com/presentations/diamond_presentation_2015/2015-10-19T00:46:53Z2015-10-19T00:46:53Z
<p>This is a paper talk which I presented on October 19th in the Genomic
Technologies Seminar at UofI.</p>
<ul>
<li><a href="http://git.donarmstrong.com/diamond_presentation.git">Code for slides in git</a></li>
<li><a href="http://www.donarmstrong.com/ld/dmnd2015/diamond_presentation_2015.pdf">PDF of slides</a></li>
</ul>
Identifying the Tissue of Origin of Extracellular Vesicles Using RNA Expression Signatureshttps://www.donarmstrong.com/posters/exosome_markers_2015/2015-10-05T15:39:48Z2015-09-17T17:49:03Z
<p>This is a poster which was presented at the
<a href="http://individualizingmedicineconference.mayo.edu/">Individualizing Medicine 2015 conference at the Mayo Clinic</a>.</p>
<ul>
<li><a href="https://github.com/uiuc-cgm/exosome_markers_per_med_sep_2015">Poster code in git</a></li>
<li><a href="http://www.donarmstrong.com/ld/em2015/exosome_markers_per_med_sep_2015_poster.pdf">PDF of poster</a></li>
<li><a href="http://www.donarmstrong.com/ld/em2015/exosome_markers_per_med_sep_2015_slides.pdf">Slides of a talk on this work</a></li>
<li><a href="http://www.donarmstrong.com/ld/em2015/exosome_markers_per_med_sep_2015_gene_markers_table.txt">Gene specific markers table</a></li>
</ul>