pages tagged rDon Armstronghttps://www.donarmstrong.com/tags/r/Don Armstrongikiwiki2017-03-23T20:57:21ZShrinking lists of gene names in Rhttps://www.donarmstrong.com/posts/shrinking_gene_names/2017-03-23T20:57:21Z2017-03-23T20:45:44Z
<p>I've been trying to finish a paper where I compare gene expression in
14 different placentas. One of the supplemental figures compares
median expression in gene trees across all 14 species, but because
tree ids like
<a href="http://www.ensembl.org/Multi/GeneTree/Image?gt=ENSGT00840000129673">ENSGT00840000129673</a>
aren't very expressive, and names like
"COL11A2, COL5A3, COL4A1, COL1A1, COL2A1, COL1A2, COL4A6, COL4A5,
COL7A1, COL27A1, COL11A1, COL4A4, COL4A3, COL3A1, COL4A2, COL5A2,
COL5A1, COL24A1" take up too much space, I wanted a function which could
collapse the gene names into something which uses bash glob syntax to
more succinctly list the gene names, like:
COL{11A{1,2},1A{1,2},24A1,27A1,2A1,3A1,4A{1,2,3,4,5,6},5A{1,2,3},7A1}.</p>
<p>Thus, a crazy function which uses <code>lcprefix</code> from <code>Biostrings</code> and
some looping was born:</p>
<div class="highlight-r"><pre class="hl">collapse<span class="hl opt">.</span>gene<span class="hl opt">.</span>names <span class="hl opt"><-</span> <span class="hl kwa">function</span><span class="hl opt">(</span>x<span class="hl opt">,</span>min<span class="hl opt">.</span>collapse<span class="hl opt">=</span><span class="hl num">2</span><span class="hl opt">) {</span>
<span class="hl slc">## longest common substring</span>
<span class="hl kwa">if</span> <span class="hl opt">(</span>is<span class="hl opt">.</span><span class="hl kwd">null</span><span class="hl opt">(</span>x<span class="hl opt">) ||</span> <span class="hl kwd">length</span><span class="hl opt">(</span>x<span class="hl opt">)==</span><span class="hl num">0</span><span class="hl opt">) {</span>
<span class="hl kwd">return</span><span class="hl opt">(</span>as<span class="hl opt">.</span><span class="hl kwd">character</span><span class="hl opt">(</span><span class="hl kwb">NA</span><span class="hl opt">))</span>
<span class="hl opt">}</span>
x <span class="hl opt"><-</span> <span class="hl kwd">sort</span><span class="hl opt">(</span><span class="hl kwd">unique</span><span class="hl opt">(</span>x<span class="hl opt">))</span>
str_collapse <span class="hl opt"><-</span> <span class="hl kwa">function</span><span class="hl opt">(</span>y<span class="hl opt">,</span>len<span class="hl opt">) {</span>
<span class="hl kwa">if</span> <span class="hl opt">(</span>len <span class="hl opt">==</span> <span class="hl num">1</span> <span class="hl opt">||</span> <span class="hl kwd">length</span><span class="hl opt">(</span>y<span class="hl opt">) <</span> <span class="hl num">2</span><span class="hl opt">) {</span>
<span class="hl kwd">return</span><span class="hl opt">(</span>y<span class="hl opt">)</span>
<span class="hl opt">}</span>
y<span class="hl opt">.</span>tree <span class="hl opt"><-</span>
<span class="hl kwd">gsub</span><span class="hl opt">(</span><span class="hl kwd">paste0</span><span class="hl opt">(</span><span class="hl str">"^(.{"</span><span class="hl opt">,</span>len<span class="hl opt">,</span><span class="hl str">"}).*$"</span><span class="hl opt">),</span><span class="hl str">"</span><span class="hl esc">\\</span><span class="hl str">1"</span><span class="hl opt">,</span>y<span class="hl opt">[</span><span class="hl num">1</span><span class="hl opt">])</span>
y<span class="hl opt">.</span>rem <span class="hl opt"><-</span>
<span class="hl kwd">gsub</span><span class="hl opt">(</span><span class="hl kwd">paste0</span><span class="hl opt">(</span><span class="hl str">"^.{"</span><span class="hl opt">,</span>len<span class="hl opt">,</span><span class="hl str">"}"</span><span class="hl opt">),</span><span class="hl str">""</span><span class="hl opt">,</span>y<span class="hl opt">)</span>
y<span class="hl opt">.</span>rem<span class="hl opt">.</span>prefix <span class="hl opt"><-</span>
<span class="hl kwd">sum</span><span class="hl opt">(</span><span class="hl kwd">combn</span><span class="hl opt">(</span>y<span class="hl opt">.</span>rem<span class="hl opt">,</span><span class="hl num">2</span><span class="hl opt">,</span><span class="hl kwa">function</span><span class="hl opt">(</span>x<span class="hl opt">){</span>Biostrings<span class="hl opt">::</span><span class="hl kwd">lcprefix</span><span class="hl opt">(</span>x<span class="hl opt">[</span><span class="hl num">1</span><span class="hl opt">],</span>x<span class="hl opt">[</span><span class="hl num">2</span><span class="hl opt">])}) >=</span> <span class="hl num">2</span><span class="hl opt">)</span>
<span class="hl kwa">if</span> <span class="hl opt">(</span><span class="hl kwd">length</span><span class="hl opt">(</span>y<span class="hl opt">.</span>rem<span class="hl opt">) ></span> <span class="hl num">3</span> <span class="hl opt">&&</span>
y<span class="hl opt">.</span>rem<span class="hl opt">.</span>prefix <span class="hl opt">>=</span> <span class="hl num">2</span>
<span class="hl opt">) {</span>
y<span class="hl opt">.</span>rem <span class="hl opt"><-</span>
collapse<span class="hl opt">.</span>gene<span class="hl opt">.</span><span class="hl kwd">names</span><span class="hl opt">(</span>y<span class="hl opt">.</span>rem<span class="hl opt">,</span>min<span class="hl opt">.</span>collapse<span class="hl opt">=</span><span class="hl num">1</span><span class="hl opt">)</span>
<span class="hl opt">}</span>
<span class="hl kwd">paste0</span><span class="hl opt">(</span>y<span class="hl opt">.</span>tree<span class="hl opt">,</span>
<span class="hl str">"{"</span><span class="hl opt">,</span><span class="hl kwd">paste</span><span class="hl opt">(</span>collapse<span class="hl opt">=</span><span class="hl str">","</span><span class="hl opt">,</span>
y<span class="hl opt">.</span>rem<span class="hl opt">),</span><span class="hl str">"}"</span><span class="hl opt">)</span>
<span class="hl opt">}</span>
i <span class="hl opt"><-</span> <span class="hl num">1</span>
ret <span class="hl opt"><-</span> <span class="hl kwb">NULL</span>
<span class="hl kwa">while</span> <span class="hl opt">(</span>i <span class="hl opt"><=</span> <span class="hl kwd">length</span><span class="hl opt">(</span>x<span class="hl opt">)) {</span>
col<span class="hl opt">.</span>pmin <span class="hl opt"><-</span>
<span class="hl kwd">pmin</span><span class="hl opt">(</span><span class="hl kwd">sapply</span><span class="hl opt">(</span>x<span class="hl opt">,</span>Biostrings<span class="hl opt">::</span>lcprefix<span class="hl opt">,</span>x<span class="hl opt">[</span>i<span class="hl opt">]))</span>
collapseable <span class="hl opt"><-</span>
<span class="hl kwd">which</span><span class="hl opt">(</span>col<span class="hl opt">.</span>pmin <span class="hl opt">></span> min<span class="hl opt">.</span>collapse<span class="hl opt">)</span>
<span class="hl kwa">if</span> <span class="hl opt">(</span><span class="hl kwd">length</span><span class="hl opt">(</span>collapseable<span class="hl opt">) ==</span> <span class="hl num">0</span><span class="hl opt">) {</span>
ret <span class="hl opt"><-</span> <span class="hl kwd">c</span><span class="hl opt">(</span>ret<span class="hl opt">,</span>x<span class="hl opt">[</span>i<span class="hl opt">])</span>
i <span class="hl opt"><-</span> i<span class="hl opt">+</span><span class="hl num">1</span>
<span class="hl opt">}</span> <span class="hl kwa">else</span> <span class="hl opt">{</span>
ret <span class="hl opt"><-</span> <span class="hl kwd">c</span><span class="hl opt">(</span>ret<span class="hl opt">,</span>
<span class="hl kwd">str_collapse</span><span class="hl opt">(</span>x<span class="hl opt">[</span>collapseable<span class="hl opt">],</span>
<span class="hl kwd">min</span><span class="hl opt">(</span>col<span class="hl opt">.</span>pmin<span class="hl opt">[</span>collapseable<span class="hl opt">]))</span>
<span class="hl opt">)</span>
i <span class="hl opt"><-</span> <span class="hl kwd">max</span><span class="hl opt">(</span>collapseable<span class="hl opt">)+</span><span class="hl num">1</span>
<span class="hl opt">}</span>
<span class="hl opt">}</span>
<span class="hl kwd">return</span><span class="hl opt">(</span><span class="hl kwd">paste0</span><span class="hl opt">(</span>collapse<span class="hl opt">=</span><span class="hl str">","</span><span class="hl opt">,</span>ret<span class="hl opt">))</span>
<span class="hl opt">}</span>
</pre></div>
Adding a Table of Contents to PDFs from Rhttps://www.donarmstrong.com/posts/adding_toc_to_pdfs_in_R/2015-01-14T01:30:31Z2015-01-14T01:29:12Z
<p>I routinely generate very large PDFs from R which have hundreds (or
thousands) of pages, and navigating these pages can be very difficult.
Unfortunately, neither R's pdf() nor its cairopdf() drivers support
creating Table of Contents (or Index) while plots are being written
out. In the case of cairo, the underlying library doesn't
<a href="http://osdir.com/ml/lib.cairo/2005-08/msg00506.html">support it either</a>,
so this isn't something that can easily be added to R directly. I had
been thinking about sitting down for months and writing the support
into cairo and R's cairo package... but real life kept getting in the way.</p>
<p>Fast forward to a week ago, when I realized that <code>pdftk</code> does support
dumping the table of contents and updating the table of contents using
<code>dump_data_utf8</code> and <code>update_info_utf8</code>! Armed with that knowledge,
and a bit of hackery, we can save an index, and then update the pdf
once it's been closed.</p>
<p>The R code then looks like the following:</p>
<pre><code> ..device.set.up <- FALSE
..current.page <<- 0
save.bookmark <- function(text,bookmarks=list(),level=1,page=NULL) {
if (!..device.set.up) {
Cairo.onSave(device = dev.cur(),
onSave=function(device,page){
..current.page <<- page
})
..device.set.up <<- TRUE
}
if (missing(page)|| is.null(page)) {
page <- ..current.page+1
}
bookmarks[[length(bookmarks)+1]] <-
list(text=text,
level=level,
page=page)
return(bookmarks)
}
write.bookmarks <- function(pdf.file,bookmarks=list()) {
pdf.bookmarks <- ""
for (bookmark in 1:length(bookmarks)) {
pdf.bookmarks <-
paste0(pdf.bookmarks,
"BookmarkBegin\n",
"BookmarkTitle: ",bookmarks[[bookmark]]$text,"\n",
"BookmarkLevel: ",bookmarks[[bookmark]]$level,"\n",
"BookmarkPageNumber: ",bookmarks[[bookmark]]$page,"\n")
}
temp.pdf <- tempfile(pattern=basename(pdf.file))
temp.pdf.info <- tempfile(pattern=paste0(basename(pdf.file),"info_utf8"))
cat(file=temp.pdf.info,pdf.bookmarks)
system2("pdftk",c(pdf.file,'update_info_utf8',temp.pdf.info,'output',temp.pdf))
if (file.exists(temp.pdf)) {
file.rename(temp.pdf,pdf.file)
} else {
warning("unable to properly create bookmarks")
}
}
</code></pre>
<p>and can be used like so:</p>
<pre><code> cairopdf(file="testing.pdf")
bookmarks <- list()
bookmarks <- save.bookmark("First plot",bookmarks)
plot(1:5,6:10)
bookmarks <- save.bookmark("Second plot",bookmarks)
plot(6:10,1:5)
dev.off()
write.bookmarks("testing.pdf",bookmarks)
</code></pre>
<p>et voila. Bookmarks and a table of contents for PDFs.</p>
<p>This basic methodology can be extended to any language which writes
PDFs and does not have a built-in method for generating a Table of
Contents. Currently, the usage of <code>Cairo.onSave</code> is a horrible hack,
and may conflict with anything else which uses the onSave hook, but
hopefully R will report the current page number from Cairo in the
future.</p>
Plotting Embedded Bitmap in Vector Plot in Rhttps://www.donarmstrong.com/posts/embedded_raster_in_vector_plot/2014-02-25T05:02:00Z2014-02-25T05:02:00Z
<p>Recently, one of my collaborators complained that one of my plots took
forever to render on his machine. The particular plot in question had
a few thousand points, many of which were overlapping. Ideally, R
would be able to simplify the vector image which was drawn to avoid
drawing points which were occluded by other points, but this is
difficult to do properly, and R currently doesn't do it.</p>
<p>However, R is able to plot to a bitmap, and bitmap images have the
nice property of automatically handling this for you. Furthermore,
raster images have recently been made far less clunky in R, so it's
pretty easy to shove an arbitrary bitmap image anywhere. With
<code>dev.capture</code> in Cairo coupled with <code>grid.raster</code> in grid, we have
everything we need to solve this problem:</p>
<div class="highlight-r"><pre class="hl"><span class="hl kwd">require</span><span class="hl opt">(</span>grid<span class="hl opt">)</span>
<span class="hl kwd">require</span><span class="hl opt">(</span>Cairo<span class="hl opt">)</span>
start<span class="hl opt">.</span>rasterplot <span class="hl opt"><-</span> <span class="hl kwa">function</span><span class="hl opt">(</span>width<span class="hl opt">=</span><span class="hl kwb">NULL</span><span class="hl opt">,</span>height<span class="hl opt">=</span><span class="hl kwb">NULL</span><span class="hl opt">) {</span>
x<span class="hl opt">.</span>y<span class="hl opt">.</span>ratio <span class="hl opt"><-</span> <span class="hl kwd">convertX</span><span class="hl opt">(</span><span class="hl kwd">unit</span><span class="hl opt">(</span><span class="hl num">1</span><span class="hl opt">,</span><span class="hl str">"npc"</span><span class="hl opt">),</span><span class="hl str">"mm"</span><span class="hl opt">,</span>valueOnly<span class="hl opt">=</span><span class="hl kwb">TRUE</span><span class="hl opt">)/</span>
<span class="hl kwd">convertY</span><span class="hl opt">(</span><span class="hl kwd">unit</span><span class="hl opt">(</span><span class="hl num">1</span><span class="hl opt">,</span><span class="hl str">"npc"</span><span class="hl opt">),</span><span class="hl str">"mm"</span><span class="hl opt">,</span>valueOnly<span class="hl opt">=</span><span class="hl kwb">TRUE</span><span class="hl opt">)</span>
width<span class="hl opt">.</span>points <span class="hl opt"><-</span> as<span class="hl opt">.</span><span class="hl kwd">numeric</span><span class="hl opt">(</span><span class="hl kwd">convertX</span><span class="hl opt">(</span><span class="hl kwd">unit</span><span class="hl opt">(</span><span class="hl num">1</span><span class="hl opt">,</span><span class="hl str">"npc"</span><span class="hl opt">),</span><span class="hl str">"points"</span><span class="hl opt">))</span>
dpi <span class="hl opt"><-</span> as<span class="hl opt">.</span><span class="hl kwd">numeric</span><span class="hl opt">(</span><span class="hl kwd">convertX</span><span class="hl opt">(</span><span class="hl kwd">unit</span><span class="hl opt">(</span><span class="hl num">1</span><span class="hl opt">,</span><span class="hl str">"inch"</span><span class="hl opt">),</span><span class="hl str">"point"</span><span class="hl opt">))</span>
<span class="hl kwa">if</span> <span class="hl opt">(</span>is<span class="hl opt">.</span><span class="hl kwd">null</span><span class="hl opt">(</span>width<span class="hl opt">) &&</span> is<span class="hl opt">.</span><span class="hl kwd">null</span><span class="hl opt">(</span>height<span class="hl opt">)) {</span>
width <span class="hl opt"><-</span> <span class="hl num">1024</span>
<span class="hl opt">}</span>
<span class="hl kwa">if</span> <span class="hl opt">(</span>is<span class="hl opt">.</span><span class="hl kwd">null</span><span class="hl opt">(</span>width<span class="hl opt">)) {</span>
width <span class="hl opt"><-</span> height<span class="hl opt">*</span>x<span class="hl opt">.</span>y<span class="hl opt">.</span>ratio
<span class="hl opt">}</span>
<span class="hl kwa">if</span> <span class="hl opt">(</span>is<span class="hl opt">.</span><span class="hl kwd">null</span><span class="hl opt">(</span>height<span class="hl opt">)) {</span>
height <span class="hl opt"><-</span> width<span class="hl opt">/</span>x<span class="hl opt">.</span>y<span class="hl opt">.</span>ratio
<span class="hl opt">}</span>
<span class="hl kwd">Cairo</span><span class="hl opt">(</span>width<span class="hl opt">=</span>width<span class="hl opt">,</span>height<span class="hl opt">=</span>height<span class="hl opt">,</span>dpi<span class="hl opt">=</span><span class="hl num">1024</span><span class="hl opt">/</span>width<span class="hl opt">.</span>points<span class="hl opt">*</span>dpi<span class="hl opt">,</span>file<span class="hl opt">=</span><span class="hl str">"/dev/null"</span><span class="hl opt">)</span>
<span class="hl opt">}</span>
stop<span class="hl opt">.</span>rasterplot <span class="hl opt"><-</span> <span class="hl kwa">function</span><span class="hl opt">(</span>plot<span class="hl opt">=</span><span class="hl kwb">TRUE</span><span class="hl opt">) {</span>
raster<span class="hl opt">.</span>image <span class="hl opt"><-</span> dev<span class="hl opt">.</span><span class="hl kwd">capture</span><span class="hl opt">(</span>native<span class="hl opt">=</span><span class="hl kwb">TRUE</span><span class="hl opt">)</span>
dev<span class="hl opt">.</span><span class="hl kwd">off</span><span class="hl opt">()</span>
<span class="hl kwa">if</span> <span class="hl opt">(</span>plot<span class="hl opt">) {</span>
grid<span class="hl opt">.</span><span class="hl kwd">raster</span><span class="hl opt">(</span>raster<span class="hl opt">.</span>image<span class="hl opt">,</span>width<span class="hl opt">=</span><span class="hl kwd">unit</span><span class="hl opt">(</span><span class="hl num">1</span><span class="hl opt">,</span><span class="hl str">"npc"</span><span class="hl opt">),</span>height<span class="hl opt">=</span><span class="hl kwd">unit</span><span class="hl opt">(</span><span class="hl num">1</span><span class="hl opt">,</span><span class="hl str">"npc"</span><span class="hl opt">))</span>
<span class="hl kwd">return</span><span class="hl opt">(</span><span class="hl kwd">invisible</span><span class="hl opt">())</span>
<span class="hl opt">}</span> <span class="hl kwa">else</span> <span class="hl opt">{</span>
<span class="hl kwd">return</span><span class="hl opt">(</span>raster<span class="hl opt">.</span>image<span class="hl opt">)</span>
<span class="hl opt">}</span>
<span class="hl opt">}</span>
</pre></div>
<p>Now we can do the following:</p>
<div class="highlight-r"><pre class="hl"><span class="hl kwd">pdf</span><span class="hl opt">(</span>file<span class="hl opt">=</span><span class="hl str">"raster.pdf"</span><span class="hl opt">)</span>
start<span class="hl opt">.</span><span class="hl kwd">rasterplot</span><span class="hl opt">()</span>
<span class="hl kwd">print</span><span class="hl opt">(</span><span class="hl kwd">xyplot</span><span class="hl opt">(</span>y<span class="hl opt">~</span>x<span class="hl opt">,</span>
data<span class="hl opt">=</span>data<span class="hl opt">.</span><span class="hl kwd">frame</span><span class="hl opt">(</span>y<span class="hl opt">=</span><span class="hl kwd">rnorm</span><span class="hl opt">(</span><span class="hl num">1</span>E8<span class="hl opt">),</span>x<span class="hl opt">=</span><span class="hl kwd">rnorm</span><span class="hl opt">(</span><span class="hl num">1</span>E8<span class="hl opt">))))</span>
stop<span class="hl opt">.</span><span class="hl kwd">rasterplot</span><span class="hl opt">()</span>
dev<span class="hl opt">.</span><span class="hl kwd">off</span><span class="hl opt">()</span>
</pre></div>
<p>and our PDF will contain a raster image, and will load in seconds
instead of taking forever to plot the file.</p>
Bug Reporting Rate in Debianhttps://www.donarmstrong.com/posts/bug_reporting_rate/2012-10-10T02:53:25Z2012-10-10T01:04:53Z
<p><a href="http://www.perrier.eu.org/weblog/2012/10/09#690000">Christian's most recent blog post</a>
got me wondering if the decline in the bug reporting rate in Debian
was something new, or something which often happened during releases.
So, lets try to figure that out. In the BTS, when a bug report is
filed, the report is written to a file called <code>bugnum.report</code>, and
then not touched from then on. Let's look at the modification date on
that file to see when each bug was filed; and since we're going to
plot this, lets only look at bugs ending in 00:</p>
<pre><code>stat -c '%n %Y' /srv/bugs.debian.org/spool/{archive,db-h}/00/*.report > ~/reporting_rate.txt
</code></pre>
<p>Now, lets get the data into R and plot it.
[For clarity, I'm not showing the R code, but it's available in the source code for this post.]</p>
<p><img class="sweavealike" src="https://www.donarmstrong.com/posts/bug_reporting_rate/66bc5382522cf6c8c072346728a71a14.png" /></p>
<p>From the plot (Bugs reported per second over time with a red loess fit
line), it looks like we do see a decline during certain periods.
However, there's an even more alarming trend of a decrease in bug
reporting in Debian which has been happening since 2006. (Note that
I've truncated the y scale significantly; there are periods in Debian
where the bug rate is astronomically high, usually corresponding to
mass bug filings; I've also limited the plot to data from 2003 on, as
I have to clean up that data significantly before I can plot it like
this.)</p>
<p>Not sure exactly what that means, but it is troubling.</p>
Plotting Ethnic Regionshttps://www.donarmstrong.com/posts/plotting_ethnic_regions/2012-10-08T18:42:11Z2012-10-08T18:41:19Z
<p>I was asked over the weekend how to plot SNPs which are associated
with specific ethnicities; the following code is a really quick stab
at the problem using the grid graphics engine in R.</p>
<pre><code>> require(grid)
> snp.position <- 1:10
> snp.integers <- sample.int(5,size=length(snp.position),replace=TRUE)
> ### the position is really the midpoint of the range
> snp.midpoint <- snp.position
> ### start of the SNP; we're assuming it should start at 0.
> snp.start <- c(0,snp.position[-1]-diff(snp.position)/2)
> ### stop of the snp; we're assuming it should stop at the last snp position
> snp.stop <- c(snp.position[-length(snp.position)]+diff(snp.position)/2,
> snp.position[length(snp.position)])
> snp.width <- snp.stop - snp.start
>
> ### these are the colors
> possible.colors <- c("red","blue","yellow","green","purple")
> snp.colors <- possible.colors[snp.integers]
>
> ### this sets up the viewport that we'll plot into
> pushViewport(viewport(height=unit(1,"npc")-unit(7,"lines"),
> width=unit(1,"npc")-unit(7,"lines")
> ))
> pushViewport(dataViewport(xscale=range(c(0,snp.stop)),yscale=c(0,1)))
>
> ### this draws a rectangle around the graph
> grid.rect()
> ### this sets up the x axis
> grid.xaxis()
> ### this labels the X axis
> grid.text("Position on Chromosome",y=unit(-2.5,"lines"))
>
> ### this draws all of the boxes corresponding to each SNP in the
> ### appropriate color
> grid.rect(x=unit(snp.start,"native"),
> width=unit(snp.width,"native"),
> y=unit(0.5,"native"),
> height=unit(0.25,"native"),just=c("left","center"),
> gp=gpar(col=snp.colors,fill=snp.colors))
>
> ### this pops the viewport
> popViewport(2)
</code></pre>
<p><img class="sweavealike" src="https://www.donarmstrong.com/posts/plotting_ethnic_regions/162c87ec69f80a181809a8a71374cbed.png" /></p>
Unnecessary Use of data.frame is Slowhttps://www.donarmstrong.com/posts/unnecessary_data_frame_slow/2012-04-16T18:32:27Z2012-04-16T18:32:27Z
<p>I've been working for a while on a reasonably large Genome-Wide
Association Study dataset which has lead me through various
interesting parts of handing large datasets in R. This dataset is
approximately 320,000 rows by 5000 columns. After getting Rmpi
working, and handling the dataset by row so I don't run out of memory,
I've managed to get pretty decent performance. However, one small
section of the code seemed to be taking forever to run.</p>
<p>It turns out that assigning data to a data.frame by row is incredibly
slow in R. Thus, a section of my code which should have taken
microseconds was taking tenths of seconds, and threatening to run all
week. Using a matrix instead (which is basically what I want anyway)
and converting to a data.frame at the very end makes the code multiple
orders of magnitude faster.</p>
<p>Moral of the story? Don't use data.frame unnecessarily.</p>
Introducing Sweavealikehttps://www.donarmstrong.com/posts/introducing_sweavealike/2012-03-22T21:06:37Z2012-03-22T21:06:37Z
<p>I use <a href="http://www.r-project.org">R</a> a lot. It's one of the primary
tools I use in my day job as a scientist analyzing large datasets. If
you use <a href="http://ctan.org">LaTeX</a> with R (as I often do), you probably
use Sweave to interleave R output and figures with your text
describing those figures using the noweb method of literate
programming.</p>
<p><a href="http://git.donarmstrong.com/?p=ikiwiki_plugins.git;a=blob;f=sweavealike.pm;hb=HEAD">Sweavealike</a>
is a plugin for <a href="http://ikiwiki.info">IkiWiki</a> that tries
to do some of the useful things for IkiWiki that sweave does for R and
LaTeX. </p>
<p>You use it like the following:</p>
<pre><code>[[!sweavealike echo=1 code="""
a <- 1
a <- a + 10
print(a)
"""]]
</code></pre>
<p>which produces this result when run:</p>
<pre><code>> a <- 1
> a <- a + 10
> print(a)
[1] 11
</code></pre>
<p>You can also generate figures with it:</p>
<pre><code>[[!sweavealike fig=1 echo=1 results="hide" code="""
plot(1:10,(1:10)^2,xlab="x",ylab=expression(x^2),main="Example Figure")
"""]]
> plot(1:10,(1:10)^2,xlab="x",ylab=expression(x^2),main="Example Figure")
</code></pre>
<p><img class="sweavealike" src="https://www.donarmstrong.com/posts/introducing_sweavealike/b8af60e27c8e6da59ab43d7a543e7294.png" /></p>
<p>The plugin itself uses the neat Statistics::R perl module to handle
all of the heavy lifting. I personally plan on using this plugin to
help write some more entries in my learning R series of posts that I'm
beginning to work on. Hopefully I'll find and fix most of the bugs as
I embark on that process so anyone else who uses the plugin won't, but
feel free to e-mail me if something isn't working as it should.</p>
<p>Finally, you shouldn't run this plugin on a publicly editable
IkiWiki instance, because that would be a trivial local user exploit
as R can run arbitrary code, read and write to arbitrary files,
exhaust all memory, etc.</p>
Debian R Archivehttps://www.donarmstrong.com/posts/debian_r_archive/2012-03-02T18:29:28Z2012-03-02T18:26:53Z
<p>For those of you were were in the various Debian infrastructure
channels, you might have noticed that I was playing around with
wanna-build, dak, sbuild, and buildd and friends.
[Thanks to everyone who answered questions, btw.] Over the past week,
I've been building most of <a href="http://cran.r-project.org">CRAN</a>,
<a href="http://bioconductor.org">Bioconductor</a>, and
<a href="http://www.omegahat.org">omegahat</a> for unstable, amd64. I plan to
build the same set of packages for i386, and will start a build
shortly for stable as well. This effort builds on top of Charles
Blundell and Dirk Eddelbuettel's
<a href="http://debian.cran.r-project.org/">cran2deb</a>, which does most of the
heavy lifting.</p>
<p>If you're like me, and use lots of different R packages, or already
use some of the R packages available on the previous build, you can
simply point your sources.list to the [http://debian-r.debian.net]
archive, load the appropriate GPG key, and away you go. I have a bit
more information available <a href="https://www.donarmstrong.com/r/debian_r/">here</a> and I will try to keep
that page updated as I build other architectures and build out for
stable.</p>