Tuesday, May 17, 2011

R tips from around the web...

The great thing of R, is that the number of available resources on the web is increasing dramatically. If you cannot afford expensive books, or you are looking straight questions or some tutorials you will find almost anything you need. Here's my short list of best place to learn and explore R other than the CRAN website:

Searching, Exploring and Procrastinating

R Seek: You probably use google while searching for specific commands. But this search engine is much more powerful. It looks for R related websites, and you can also filter to "functions", "blogs", "books" and etc. Awesome!

R Graph Gallery : This is unfortunately not updated anymore, but you can find fancy graphs and codes which can inspire you for something even better.

R Graphical Manual: Visual learning is always the best. This is probably the principle of this website, which allows you to search for functions and packages from the standpoint of their graphical outputs.

R Reference Card : This is actually from CRAN, but it's an extremely handy reference card that you can print and put on your office wall to show people how geeky you are. Most of them are daily uses functions that you'll probably learn soon anyway, but it's extremely useful if you've just got started.

R Bloggers: New research method, elaborated codes, inspiring plots, novel ideas on everything related to R cannot be found in books. The cutting edge is in the blogosphere, where hundreds of peer R users are perhaps tackling the same issue you are working one. This websites put together almost 200 bloggers from around the world. Just put in you're RSS feed, and you'll get about 10 daily posts with new ideas. Inspiring.


Learning and Tutorials

Quick-R. Nice and intuitive website on R, covering wide range of topics from analysis to plotting

Quantitative Archaeology Wiki . This is still under development but has some sections on R specifically designed for archaeologists.

There are of course plenty of other tutorials around the web. If you want use GRASS and R together you can find a nice introductory tutorial on geostats here. A number of academics also share their courses online so you can sneak in and learn some great stuff like this one. My favourite course isRichard McElreath's  Statistical Thinking in Evolutionary Ecology Course. This is simply mind-opening.

Monday, May 9, 2011

Multiple Y-axis in a R plot

I often have to to plot multiple time-series with different scale of values for comparative purposes, and although placing them in different rows are useful, placing on a same graph is still useful sometimes...
I searched a bit about this, and found some nice suggestions for 2 Y-axis, but haven't found anything for more 2+. So here's my solution:



#Create Dataset

time<-seq(7000,3400,-200)
pop<-c(200,400,450,500,300,100,400,700,830,1200,400,350,200,700,370,800,200,100,120)
grp<-c(2,5,8,3,2,2,4,7,9,4,4,2,2,7,5,12,5,4,4)
med<-c(1.2,1.3,1.2,0.9,2.1,1.4,2.9,3.4,2.1,1.1,1.2,1.5,1.2,0.9,0.5,3.3,2.2,1.1,1.2)


#Define Margins. The trick is to use give as much space possible on the left margin (second value)
par(mar=c(5, 12, 4, 4) + 0.1)





#Plot the first time series. Notice that you don't have to draw the axis nor the labels


plot(time, pop, axes=F, ylim=c(0,max(pop)), xlab="", ylab="",type="l",col="black", main="",xlim=c(7000,3400))
points(time,pop,pch=20,col="black")
axis(2, ylim=c(0,max(pop)),col="black",lwd=2)
mtext(2,text="Population",line=2)





#Plot the second time series. The command par(new=T) is handy here. If you just need to plot two timeseries, you could also use the right vertical axis as well. In that case you have to substitute "2" with "4" in the functions axis() and mtext(). Notice that in both functions lines is increased so that the new axis and its label is placed to the left of the first one. You don't need to increase the value if you use the right vertical axis.


par(new=T)
plot(time, med, axes=F, ylim=c(0,max(med)), xlab="", ylab="", 
type="l",lty=2, main="",xlim=c(7000,3400),lwd=2)
axis(2, ylim=c(0,max(med)),lwd=2,line=3.5)
points(time, med,pch=20)
mtext(2,text="Median Group Size",line=5.5)


#Plot the third time series. Again the line parameter are both further increased.


par(new=T)
plot(time, grp, axes=F, ylim=c(0,max(grp)), xlab="", ylab="", 
type="l",lty=3, main="",xlim=c(7000,3400),lwd=2)
axis(2, ylim=c(0,max(grp)),lwd=2,line=7)

points(time, grp,pch=20)
mtext(2,text="Number of Groups",line=9)


#We can now draw the X-axis, which is of course shared by all the three time-series.


axis(1,pretty(range(time),10))
mtext("cal BP",side=1,col="black",line=2)


#And then plot the legend.


legend(x=7000,y=12,legend=c("Population","Median Group Size","Number of Groups"),lty=c(1,2,3))




Friday, May 6, 2011

Running an R-based ABM in parallel on a Multicore Desktop

I've been running for the last couple of months a lot of simulations written in R on legion cluster  here at UCL. But I still do many things on our quad-core server here at the institute, and parallelising the simulations is extremely handy.

So here's my recipe:

1)Firstly, the main ingredient,  you need an R function for your model where you have a set of parameters to sweep, something like:

Rsim(InitialPopulation=100,K=5000,ReproductiveRate=0.2,TransmissionRate=0.4,MutationRate=0.1)
{
#Put Something amazing and cool here
# ...


return(list(pop=pop,fitness=fitness,diversity=diversity))
}

The function can return everything, in my case this will be a list with three vectors, say $pop, $fitness and $diversity.

2)For the actual parallelisation, you need the following two packages: foreach and doMC and their dependencies. After loading both packages you specify the number of cores you want to use, as follows:

library(foreach)
library(doMC)

cores=4  
registerDoMC(cores)


3)Now, at this point I create a sweepSpace, which is basically a data.frame with all the combinations of parameters that I want to explore, repeated n times, with n being the number of runs. So in my case I want to sweep the MutationRate, the TransmissionRate and ReproductiveRate for 100 runs.


runs=50
mutationSweep=c(0.0001,0.001,0.01,0.1)
reproductionSweep=c(0.1,0.2,0.3)
transmissionSweep=c(0,0.5,1.0)


sweepSpace=expand.grid(runs=1:runs,m= mutationSweep,r= reproductionSweep,z= transmissionSweep)


4)Now we are ready to run the simulation with the following code


res=foreach(a=sweepSpace$runs,b=sweepSpace$m,c=sweepSpace$r,d=weepSpace$z)%dopar%{Rsim(InitialPopulation=100,K=5000,ReproductiveRate=c,TransmissionRate=d, TransmissionRate =b)}


The foreach() function will read each row of sweepSpace and assign the values of each column  to the letters, and then use these for the Rsim() function. The resulting object is a list with a length equal to the number of rows of sweepSpace (thus in this case 500), containing the results of Rsim() in each of them. The parallelisation allows  the simultaneous but independent run for all combinations and the storage of the results which follows the order in the sweepSpace. So if you have a specific parameter combination which takes longer to compute, the foreach() function allows to carry on, in the meanwhile, all the other combinations and put the results in the correct place.

5)Get and plot your results!

The sweepSpace can also be used as map to go through the resulting list.
For instance, if you want to create a function which allows the plotting of all the population dynamics on a single plot, you can write something as the following code:

plotSimPop=function(res,sweepSpace, ReproductiveRate, TransmissionRate, TransmissionRate)
{
#First we want to retrieve the index of list having the specific parameters that we've requested:


tmp=res[which(sweepSpace$r==ReproductiveRate&sweepSpace$m== MutationRate&sweepSpace$z==TransmissionRate)]


#we then retrieve the number of runs, and create a matrix with our results, with row corresponding to the number of runs, and columns to the number of timesteps (here set to 1000):


nruns=length(tmp)
resMat=matrix(0,nruns,1000)


#We then loop through the results collecting and storing the population dynamics in our matrix


for (x in 1:nruns)
{
resMat[x,]=res[[x]]$pop
}


#And we can now plot the results, with the average dynamics in red:
average=apply(resMat,2,mean,na.rm=TRUE)
plot(resMat[1,],type="n",xlim=c(1,1000),ylim=c(min(resMat),max(resMat),xlab="Time",ylab="Population")
for (x in 1:nruns)
{
lines(resMat[x,],col="grey")
}
lines(average,col="red")
}