Previously we looked at how you can use functions to simplify yourcode. Ideally you have a function that performs a singleoperation, and now you want to use it many times to do the same operation onlots of different data. The naive way to do that would be something like this:
Related Answers what is the equation in standard form of the line that passes through the point (10,-3) and has a slope of 2/5 I need with this equation Can someone help me with this ques5ion Find the distance from P to l. Line l contains the points (-3,5) and (0,-1). The resulting system will have m2 equations, one for each node. 2 8 3 3 5 ) 2 4 2 2 4 1 (1=2) 2 8 3 (1=2) (0) 7 2 3 5. We don ’ t need to permute back. Permute for mac 2.2.8 简单实用的视频转换工具 Posted by Rolos On 十月 13, 2016 1 Comment Permute 2提供了易于使用的拖放和拖放视频转换。.
But this isn’t very nice. Yes, by using a function, you have reduceda substantial amount of repetition. That is nice. But there isstill repetition. Repeating yourself will cost you time, both now andlater, and potentially introduce some nasty bugs. When it comes torepetition, well, just don’t.
The nice way of repeating elements of code is to use a loop of somesort. A loop is a coding structure that reruns the same bit of codeover and over, but with only small fragments differing betweenruns. In R there is a whole family of looping functions, each withtheir own strengths.
First, it is good to recognise that most operations that involvelooping are instances of the split-apply-combine strategy (this termand idea comes from the prolific Hadley Wickham,who coined the term in thispaper). You start with abunch of data. Then you then Split it up into many smallerdatasets, Apply a function to each piece, and finally Combinethe results back together.
Some data arrives already in its pieces - e.g. output files from froma leaf scanner or temperature machine. Your job is then to analyseeach bit, and put them together into a larger data set.
Sometimes the combine phase means making a new data frame, other times it mightmean something more abstract, like combining a bunch of plots in a report.
Either way, the challenge for you is to identify the pieces that remain the samebetween different runs of your function, then structure your analysis aroundthat.
Ok, you got me, we are starting with for loops. But not in the way you think.
When you mention looping, many people immediately reach for for
. Perhapsthat’s because, like me, they are already familiar with these other languages,like basic, python, perl, C, C++ or matlab. While for
is definitely the mostflexible of the looping options, we suggest you avoid it wherever you can, forthe following two reasons:
- It is not very expressive, i.e. takes a lot of code to do what you want.
- It permits you to write horrible code, like this example from my earlierwork:
The main problems with this code are that
- it is hard to read
- all the variables are stored in the global scope, which is dangerous.
All it’s doing is making a plot! Compare that to something like this
That’s much nicer! It’s obvious what the loop does, and no new variables arecreated. Of course, for the code to work, we need to define the function
which actually makes our plot, but having all that detail off in afunction has many benefits. Most of all it makes your code morereliable and easier to read. Of course you could do this easilywith for loops too:
but the temptation with for
loops is often to cram a little extracode in each iteration, rather than stepping back and thinking aboutwhat you’re trying to achieve.
So our reason for avoiding for
loops, and the similar functionswhile
and repeat
, is that the other looping functions, likelapply
, demand that you write nicer code, so that’s we’ll focus onfirst.
There are several related function in R which allow you to apply some functionto a series of objects (eg. vectors, matrices, dataframes or files). They include:
lapply
sapply
tapply
aggregate
mapply
apply
.
Each repeats a function or operation on a series of elements, but theydiffer in the data types they accept and return. What they all incommon is that order of iteration is not important. This iscrucial. If each each iteration is independent, then you can cyclethrough them in whatever order you like. Generally, we argue that youshould only use the generic looping functions for
, while
, andrepeat
when the order or operations is important. Otherwisereach for one of the apply tools.
lapply
applies a function to each element of a list (or vector),collecting results in a list. sapply
does the same, but will try tosimplify the output if possible.
Lists are a very powerful and flexible data structure that few people seem toknow about. Moreover, they are the building block for other data structures,like data.frame
and matrix
. To access elements of a list, you use thedouble square bracket, for example X[[4]]
returns the fourth element of thelist X
. If you don’t know what a list is, we suggest youread more about them,before you proceed.
Basic syntax
Here X
is a list or vector, containing the elements that form the input to thefunction f
. This code will also return a list, stored in result
, with samenumber of elements as X
.
Usage
lapply is great for building analysis pipelines, where you want to repeat aseries of steps on a large number of similar objects. The way to do this is tohave a series of lapply statements, with the output of one providing the input toanother:
The challenge is to identify the parts of your analysis that stay the same andthose that differ for each call of the function. The trick to using lapply
isto recognise that only one item can differ between different function calls.
It is possible to pass in a bunch of additional arguments to your function, butthese must be the same for each call of your function. For example, let’s say wehave a function test
which takes the path of a file, loads the data, and testsit against some hypothesised value H0. We can run the function on the file“myfile.csv” as follows.
We could then run the test on a bunch of files using lapply:
But notice, that in this example, the only this that differs between the runsis a single number in the file name. So we could save ourselves typing these byadding an extra step to generate the file names
The nice things about that piece of code is that it would extend as long as wewanted, to 10000000 files, if needed.
Example - plotting temperature for many sites using open weather data
Let’s look at the weather in some eastern Australian cities over thelast couple of days. The websiteopenweathermap.com provides access to allsorts of neat data, lots of it essentially real time. We’ve parcelledup some on the nicercode website to use. In theory, this sort ofanalysis script could use the weather data directly, but we don’t wantto hammer their website too badly. The code used to generate thesefiles is here.
We want to look at the temperatures over the last few days for the cities
The data are stored in a url scheme where the Sydney data is athttp://nicercode.github.io/guides/repeating-things/data/Sydney.csvand so on.
The URLs that we need are therefore:
We can write a function to download a file if it does not exist:
and then run that over the urls:
Notice that we never specify the order of which file is downloaded inwhich order; we just say “apply this function (download.maybe
) tothis list of urls. We also pass the path
argument to every functioncall. So it was as if we’d written
but much less boring, and scalable to more files.
The first column, time
of each file is a string representing dateand time, which needs processing into R’s native time format (dealingwith times in R (or frankly, in any language) is a complete pain). Ina real case, there might be many steps involved in processing eachfile. We can make a function like this:
that reads in a file given a filename, and then apply that function toeach filename using lapply
:
We now have a list, where each element is a data.frame
ofweather data:
We can use lapply
or sapply
to easy ask the same question to eachelement of this list. For example, how many rows of data are there?
What is the hottest temperature recorded by city?
or, estimate the autocorrelation function for each set:
I find that for loops can be easier to plot data, partly becausethere is nothing to collect (or combine) at each iteration.
Parallelising your code
Another great feature of lapply is that is makes it really easy to paralleliseyour code. All computers now contain multiple CPUs, and these can all be put towork using the great multicore package.
In the case above, we had naturally “split” data; we had a vector ofcity names that led to a list of different data.frames of weatherdata. Sometimes the “split” operation depends on a factor. Forexample, you might have an experiment where you measured the size ofplants at different levels of added fertiliser - you then want to knowthe mean height as a function of this treatment.
However, we’re actiually going to use some data on ratings of seinfeld episodes, taken from the [Internet movie Database](http://www.reddit.com/r/dataisbeautiful/comments/1g7jw2/seinfeld_imdb_episode_ratings_oc/).
Columns are Season (number), Episode (number), Title (of theepisode), Rating (according to IMDb) and Votes (to construct therating).
Make sure it’s sorted sensibly
Biologically, this could be Site / Individual / ID / Mean size /Things measured.
Hypothesis: Seinfeld used to be funny, but got progressively lessgood as it became too mainstream. Or, does the mean episode ratingper season decrease?
Now, we want to calculate the average rating per season:
and so on until:
As with most things, we could automate this with a for loop:
That’s actually not that horrible to do. But we it could benicer. We first split the ratings by season:
Then use sapply to loop over this list, computing the mean
Then if we wanted to apply a different function (say, compute theper-season standard error) we could just do:
But there’s still repetition there. Let’s abstract that away a bit.
Suppose we want a: 1. response variable (like Rating was) 2. grouping variable (like Season was) 3. function to apply to each level
This just writes out exactly what we had before
We can compute the mean rating by season again:
which is the same as what we got before:
Of course, we’re not the first people to try this. This is exactlywhat the tapply
function does (but with a few bells and whistles,especially around missing values, factor levels, additionalarguments and multiple grouping factors at once).
So using tapply
, you can do all the above manipulation in asingle line.
There are a couple of limitations of tapply
.
The first is that getting the season out of tapply
is quitehard. We could do:
But that’s quite ugly, not least because it involves the conversionnumeric -> string -> numeric.
Better could be to use
But that requires knowing what is going on inside of tapply
(thatunique levels are sorted and data are returned in that order).
I suspect that this approach:
is probably the most fool-proof, but it’s certainly not pretty.
However, the returned format is extremely flexible. If you do:
The aggregate
function provides a simplfied interface to tapply
that avoids this issue. It has two interfaces: the first issimilar to what we used before, but the grouping variable now mustbe a list or data frame:
(note that dat['Season']
returns a one-column data frame). Thecolumn ‘x’ is our response variable, Rating, grouped by season. Wecan get its name included in the column names here by specifyingthe first argument as a data.frame
too:
The other interface is the formula interface, that will be familiarfrom fitting linear models:
This interface is really nice; we can get the number of votes heretoo.
If you have multiple grouping variables, you can write things like:<div class=’bogus-wrapper’></div>
to apply a function to each pair of levels of factor1
and factor2
.
This is great in Monte Carlo simulation situations. For example.Suppose that you flip a fair coin n times and count the number ofheads:
You can run the trial a bunch of times:
and get a feel for the results. If you want to replicate the trial100 times and look at the distribution of results, you could do:
and then you could plot these:
“for
” loops shine where the output of one iteration depends onthe result of the previous iteration.
Suppose you wanted to model random walk. Every time step, with 50%probability move left or right.
Start at position 0
Move left or right with probability p (0.5 = unbiased)
Update the position
Let’s abstract the update into a function:
Repeat a bunch of times:
To find out where we got to after 20 steps:
Permute 2 2.2.8 Games
If we want to collect where we’re up to at the same time:
Permute 2 2.2.8 Download
Pulling that into a function:
We can then do 30 random walks:
Of course, in this case, if we think in terms of vectors we canactually implement random walk using implicit vectorisation:
Permute 2 2.2.8 Mod
Which reinforces one of the advantages of thinking in terms offunctions: you can change the implementation detail without therest of the program changing.