# Thread: Pearson correlation in raster datasets

1. ## Pearson correlation in raster datasets

Dear all,

I need to make a correlation between two raster datasets. I want to use Raster Calculator, but I'm still finding a bit difficult to work it out. I have the formula from statistics books, and I know that I'm required to multiply, and add the grids, right? But it's not so straight forward.
How am I supposed to find N, for instance? Should I just multiply the Columns and Rows from my grid to see how many pixels I have on the dataset?
The formula for the Pearson correlation coefficient is quite long and I don't think I should attempt to do everything in one go. I reckon I should do it by different stages.
Any ideas would be very welcome!

Just a couple more things about the two datasets: one of them is a NDVI that I acquired from LANDSAT 5TM, and the other raster has been originated from an interpolation of points (rainfall gauges). I have used Mask on both grids, therefore they have the same number or rows and columns (same number of pixels?).

Thanks for your time.
Alan.

2. ## Re: Pearson correlation in raster datasets

Good questions, Alan. Let's go over the points one by one:
• One way to find N is to create a 1/NoData indicator grid for the overlap of the two input grids (which I will call [X] and [Y]). Use this indicator as the zone for the zonal summaries needed to find the correlation. In particular, N is the zonal sum of the indicator itself--but as you will see below, you won't need it.
• One way to create an indicator of the overlap is to compute SetNull( IsNull( [X]*[Y] ), 1 ).
• Wherever you see an ordinary unary or binary arithmetic operation in the formula (including, but not limited to, negation, addition, subtraction, multiplication, division, and powers), use the same one in Map Algebra.
• Wherever you see a sum over cases compute a zonal sum. If the sum is divided by N it's really an average so just use a zonal average for the whole thing.
• Yes, it's an excellent idea to do the calculation in stages. That lets you check intermediate results and provides diagnostics in case anything goes wrong.
You have to worry a bit about numeric instability because grids can be big and SA does its calculations in only single precision. Thus, I would avoid some of the old-fashioned abbreviated formulas that involve subtracting two potentially large sums. In this particular case, a good approach to consider is to standardize each grid first. The correlation is just the average product of the standardized grids. A subtle point is that you must first mask the input grids to their common overlap (which I think is something you have already done). Having done that, zonal summaries of [X] and [Y] will produce grids of their means ([mX] and [mY], say) and standard deviations ([sX] and [sY]). It seems stupid to output the zonal summaries as grids--each one has just a single constant value--but it is the easiest way to go. With these grids in hand, the standardized ones are computed easily as
xi = ([X] - [mX])/[sX]
and
eta = ([Y] - [mY])/[sY]
Finally, compute the zonal average of [xi]*[eta]. You probably want to output this in table form: it will be a table with a single record (for the one zone you're working in) and a single field (the average of [xi]*[eta]). That number is the correlation coefficient.

NB: This procedure uses a somewhat non-standard definition of correlation because SA does not compute standard deviations in a standard way (!). However, it differs by so little from the standard definition that there's no reason to worry. I mention it in case you are double-checking your results against other formulas or software, perhaps using small test grids where the difference will show up in one of the later decimal places.

I'll end with a comment: the correlation between [NDVI] and an interpolated rainfall grid may tell you more about the interpolation method than about rainfall. To avoid confounding the interpolator with rainfall, consider extracting [NDVI] values at the points where you have rainfall measurements to produce a table of ([NDVI], [Rainfall]) pairs. Compute the correlation between these in the usual way (via field calculations, a spreadsheet, a statistical package, or whatever). This is generally a good approach, but it's especially worth considering for rainfall, which is notoriously difficult to interpolate accurately.

3. ## Re: Pearson correlation in raster datasets

Bill,

Thank you very much for your reply. I will try what you suggested and I should be coming back here with either some results for discussions (or more questions!) if that's okay.

Just a couple of things regarding the interpolation. My area of study is data scarce and I didn't have a choice. I would have only three points of NDVI where the data was actually collected. I think I will take the risk and still use the interpolation/NDVI analysis - or do you think the results would be worthless?

Regards,
Alan.

4. ## Re: Pearson correlation in raster datasets

Hi again Bill, I hope you'll see this.

I'm trying to get my head around some of your instructions, but with no success. Now I have the indicator grid for the overlapping datasets, but I'm very confused with how to use it in the calculations. I don't think I quite understood the following:

•Wherever you see an ordinary unary or binary arithmetic operation in the formula (including, but not limited to, negation, addition, subtraction, multiplication, division, and powers), use the same one in Map Algebra.
•Wherever you see a sum over cases compute a zonal sum. If the sum is divided by N it's really an average so just use a zonal average for the whole thing
That went straight over my head - really sorry.
If you have the chance to clarify that - it would be so much appreciated. I will keep on trying in the meantime.

Regards,
Alan.

5. ## Re: Pearson correlation in raster datasets

If you have NDVI and rainfall at only three common points, you are relying almost entirely on the interpolation, Alan. My first choice would be to get the NDVI at the other rainfall measurement locations. If that's truly difficult, at least consider a sensitivity analysis: vary the interpolation as radically as you can. (This can be done well with stochastic simulation; but if you don't want to get into that, use some sharply different interpolators, such as nearest-neighbor, IDW with several different powers, and cubic splines.) The results that hold up regardless of the interpolation method have some claim to being real.

As far as the mechanics of translating formulas to raster operations go, it's really simple. Consider first a textbook formula for standardizing data. Let the data be (x_1, x_2, ..., x_n). Call their mean m and their standard deviation s. The formula for the standardized values xi_i, i = 1, 2, ..., n, would read
xi_i = (x_i - m)/s.
On a raster the indexes i represent individual cells. Therefore the mean and standard deviation are summaries of the data over the grid: they are numbers. I'll get to that in a moment, to show how to create two new grids whose values everywhere are equal to m and s, respectively. With map algebra you don't have to reference the indexes, because all local operations are automatically done cell-by-cell. So the map algebra syntax just drops the i's and puts brackets [] around the names of existing raster datasets to indicate they are grids, as in
xi = ([x] - [m])/[s]
(Notice the lack of brackets around the name of the new grid. Also, the new name "xi" refers to the Greek letter corresponding to "x"; it has nothing to do with "x" and the index "i" mashed together!)

Next consider a formula involving a summation. The mean is a good example, where a textbook formula is
m = Sum{x_i, i = 1 to n} / n
You obtain such summaries with zonal statistics. The zonal average of an entire grid does the following:
• For each category in the zone grid, it locates the cells having that category.
• For each such cell in the category, it obtains the value in the corresponding cell of the value grid.
• It computes the desired statistical summary, such as the mean, sum, standard deviation, or whatever.
• It copies that summary value into every corresponding cell of the zone grid.
Thus, this textbook formula is implemented in two steps:
1. Create a zone grid having one constant value at every cell of the [x] grid.
2. Compute the zonal average grid for [x] based on this zone and call it m.
A zonal standard deviation will compute the s grid for you.

Zonal summaries are available from the SA menu, as ArcTools, and via the Raster Calculator, Command Line, SOMA tool, and Model Builder.

6. ## Re: Pearson correlation in raster datasets

Thanks a lot for that, Bill. You've been a great help. Very much appreciated.

7. ## Re: Pearson correlation in raster datasets

Bill,

Following the instructions you gave me above, I still have some doubts if what I'm doing is correct.
I followed the steps, but some there's some confusion:

In the formula xi_i = (x_i - m)/s, for instance, the zone grid is x_i, correct? I'm not sure how to obtain that zone grid (and I'm sure this is really simple!). Is this anything to do with the indicator grid I created initially?

Another question is regarded the zonal average of [xi]*[eta] that you mentioned. Do I obtain that by simply multiplying those two grids and dividing it by N?

8. ## Re: Pearson correlation in raster datasets

Hi Bill,

I think I've figured it out now. And yes, it was very simple. But feel free to make any comments. I'm new to all this, so any reassurance is more than welcome.

Many thanks.
Alan.

9. ## Re: Pearson correlation in raster datasets

Originally Posted by alanfer17
In the formula xi_i = (x_i - m)/s, for instance, the zone grid is x_i, correct? I'm not sure how to obtain that zone grid (and I'm sure this is really simple!). Is this anything to do with the indicator grid I created initially?
There's only one zone grid. It determines which cells will be summed over. (More generally, a zone grid can contain multiple categories. Zonal sums are then performed separately for the cells within each zone. This lets you do a whole lot of calculations at once, provided only that they take place over non-overlapping regions.) Thus, whenever you see something like

\Sigma_{i=1}^{n} ...

in a textbook, you know it will be implemented as a zonal sum (or zonal average). All the stuff coming after the summation symbol \Sigma has to be computed, cell by cell, in advance so you can sum over its cells.

Another question is regarded the zonal average of [xi]*[eta] that you mentioned. Do I obtain that by simply multiplying those two grids and dividing it by N?
Just multiply: the average automatically takes care of the division by N.

10. ## Re: Pearson correlation in raster datasets

Bill,

I'm getting very strange results for the correlations, so I'm assuming there's something wrong with the way I'm creating the zone grid, unfortunately. I don't seem to get this one constant value at every cell that you mentioned above.
Is the formula for the standardized values xi_i = (x_i - m)/s the one that should be used to create this grid?
I've used the indicator to produce grids of the means ([mX] and [mY], and standard deviations ([sX] and [sY]). But in the formula above, what will be the x_i? I know this should be a very straight forward operation, but I'm sure what I'm doing is wrong.

Thank you.
Alan.

11. ## Re: Pearson correlation in raster datasets

There is only one zone grid for the entire set of calculations, Alan: it must have a constant value at all cells that will be involved in your calculation and have NoData values elsewhere. A zonal sum of some other grid "Z" using this grid for the zones is the same as performing a summation over the values found in the cells of [Z] that overlap the zone cells. If we index those cells with a letter "i" and let their count be "N", so that i runs from 1 through N, then a zonal sum of [Z] is obtained as the sum of the z_i for i = 1 to N and the zonal mean of [Z] is obtained as 1/N times the zonal sum. Thus, for example, if [Z] has been computed as ([x] - [m])/[s] for grids [x] (the original data), [m] (a constant grid equal to the mean of [x] throughout the zone), and [s] (a constant grid equal to the sd of [x] within the zone), then by definition cell i of [Z] is the value (x_i - m_i)/s_i and the zonal mean of [Z] equals
Sum_i=1^n { (x_i - m_i)/s_i }

= Sum_i=1^n{ (x_i - m)/s }
due to the constancy of [m] and [s]. Thus [Z] contains the standardized values of [x].

12. ## Re: Pearson correlation in raster datasets

Bill,
My question remains:
if I’m computing Z as ([x] - [m])/[s] for grids [x] (the original data), [m] (a constant grid equal to the mean of [x] throughout the zone), and [s] (a constant grid equal to the sd of [x] within the zone),
by [x] (original data) do you mean:
1) – the sum of [x] using the INDICATOR grid (created earlier) in the “feature zone” field on SA Zonal Statistics, VALUE in the SA “zone field”, and the ORIGINAL DATA [X] as the “input raster” field?
If that’s the case, my result will be a range of values (high to low), which will give me 2 values for a correlation in the end. Note that I will be doing the same thing for the [Y] original value, which will also give me “high to low” values.
2) – if by [x] (original data) you mean the raw raster data (say, my NDVI data), I will also have “high and low” values.
Basically, my only question is: what exactly is [x] in the formula ([x] - [m])/[s]?

Sorry for being such a hard work.

13. ## Re: Pearson correlation in raster datasets

Originally Posted by alanfer17
... if I’m computing Z as ([x] - [m])/[s] for grids [x] (the original data), ... what exactly is [x] in the formula ([x] - [m])/[s]?
The answer is right there in your question, Alan! [x] holds the original data and [z] will hold the standardized version of the same data.

(The change from [x] to [z] is exactly like the change from measuring a temperature in degrees Fahrenheit to degrees Kelvin: there is a shift of the zero, accomplished by subtracting [m], and a change of units, accomplished by dividing by [s]. Because the average of [x] is [m], the average of [z] -- being [m] less -- is zero; because the standard deviation of [x] is [s], the standard deviation of [z] -- being 1/[s] as large -- is one.)

The principle here is that the Pearson correlation between any two grids [x] and [y] is their average product, provided both [x] and [y] have been re-expressed to have averages of zero and standard deviations of one. Thus, you need only know how to accomplish three things: standardizing values of a grid, multiplying two grids pointwise (that is, a local multiplication), and finding their averages.

Finally, note that the average value of any grid (such as the product of the standardized 'x' and standardized 'y') is always a single number; there's no possibility here of obtaining a range of answers!

14. ## Re: Pearson correlation in raster datasets

This is a great example of how incorporating R Statistical Software into ArcGIS would be a huge benefit. R has no problem calculating correlations on data that are provided as an "array", i.e. rows and columns. There is no need to reinvent the wheel, although it does make for an interesting excercise.

My advice would be to abandon ArcGIS for the heavy lifting and just download R. It's notoriously ugly to program, but with a few Google searches I bet you could write correlation scripts in a couple of hours. Using Python, R, and rpy2 I've written script tools to automate this procedure. But....

One issue I haven't been able to resolve is the reliability of correlation coefficients when the data are spatially autocorrelated, which may be a problem.

I'll get off my soapbox now, but this is an area in which I would love to see more improvement.

15. ## Re: Pearson correlation in raster datasets

Originally Posted by philmorefield
This is a great example of how incorporating R Statistical Software into ArcGIS would be a huge benefit. ...
Excellent points. Using R to compute correlation coefficients is overkill: you can compute them easily enough with map algebra (as I outlined). But R offers (literally) thousands of sophisticated capabilities that are difficult or impossible to compute with map algebra. A seamless GIS-R connection would be a powerful, extensible tool.

One issue I haven't been able to resolve is the reliability of correlation coefficients when the data are spatially autocorrelated, which may be a problem.
Correlation coefficients are not made "unreliable" by the presence of spatial autocorrelation. What one has to be concerned about is the size of estimation errors in regression coefficients when the residuals of the dependent variable are spatially autocorrelated. This, IMHO, is less of a problem with true grids of data (that is, with grids not derived by interpolating or resampling sparser datasets). In general, assessing the potential for unreliability in estimates depends on the application, the data, how the data are modeled, and what decisions the analysis is intended to support.

16. ## Re: Pearson correlation in raster datasets

In ArcInfo workstation this was as easy as typing the following at the grid prompt.

correlation grid1 grid2

ESRI - why have we lost functions with ArcGIS?

17. ## Re: Pearson correlation in raster datasets

What is the difference between using the above procedure and using band statistics to calculate the correlation between rasters? I am hoping to find the correlations between my environmental variables prior to further analysis in maxent. Any thoughts would be greatly appreciated.
Thanks,
Jessica

18. ## Re: Pearson correlation in raster datasets

The band statistics procedure does compute correlations and covariances. Unlike almost all other related calculations (of correlations, covariances, variances, and standard deviations) throughout ArcGIS, it uses 1/(n-1) in the denominator rather than 1/n. However, for all but the tiniest grids this difference in covariances is inconsequential and it makes no difference at all for the correlations (except for grids with exactly one non-NoData cell).

19. ## Re: Pearson correlation in raster datasets

Like Jessica I am also trying to find which of my environmental layers are highly correlated before using them in Maxent. The correlation matrix that it produces has many values outside of the -1 to 1 range, can anyone tell me what the problem is there? (I'm using the worldclim 1-19 layers, 30arc seconds, clipped to Europe).

20. ## Re: Pearson correlation in raster datasets

Exactly how are you obtaining the correlation matrix, Sarah, and how far outside the [-1,1] range are the values? Several methods have been mentioned in this thread. If you follow the instructions I gave at the outset, it is mathematically impossible to obtain values outside the range [-1,1]. Just to be sure, I have executed that workflow with three small grids and it obtained correct values for the inter-grid correlations. For the diagonal entries, which should all be 1.0, it computed values that were a tiny bit too high (in the last decimal place) due to floating point roundoff error.