+ Reply to Thread
Results 1 to 20 of 20

Thread: Pearson correlation in raster datasets

  1. #1
    Alan Fernandes
    Join Date
    Sep 2010
    Posts
    8
    Points
    0
    Answers Provided
    0


    0

    Default Pearson correlation in raster datasets

    Dear all,

    I need to make a correlation between two raster datasets. I want to use Raster Calculator, but I'm still finding a bit difficult to work it out. I have the formula from statistics books, and I know that I'm required to multiply, and add the grids, right? But it's not so straight forward.
    How am I supposed to find N, for instance? Should I just multiply the Columns and Rows from my grid to see how many pixels I have on the dataset?
    The formula for the Pearson correlation coefficient is quite long and I don't think I should attempt to do everything in one go. I reckon I should do it by different stages.
    Any ideas would be very welcome!

    Just a couple more things about the two datasets: one of them is a NDVI that I acquired from LANDSAT 5TM, and the other raster has been originated from an interpolation of points (rainfall gauges). I have used Mask on both grids, therefore they have the same number or rows and columns (same number of pixels?).

    Thanks for your time.
    Alan.

  2. #2
    William Huber

    Join Date
    Apr 2010
    Posts
    694
    Points
    73
    Answers Provided
    2


    0

    Default Re: Pearson correlation in raster datasets

    Good questions, Alan. Let's go over the points one by one:
    • One way to find N is to create a 1/NoData indicator grid for the overlap of the two input grids (which I will call [X] and [Y]). Use this indicator as the zone for the zonal summaries needed to find the correlation. In particular, N is the zonal sum of the indicator itself--but as you will see below, you won't need it.
    • One way to create an indicator of the overlap is to compute SetNull( IsNull( [X]*[Y] ), 1 ).
    • Wherever you see an ordinary unary or binary arithmetic operation in the formula (including, but not limited to, negation, addition, subtraction, multiplication, division, and powers), use the same one in Map Algebra.
    • Wherever you see a sum over cases compute a zonal sum. If the sum is divided by N it's really an average so just use a zonal average for the whole thing.
    • Yes, it's an excellent idea to do the calculation in stages. That lets you check intermediate results and provides diagnostics in case anything goes wrong.
    You have to worry a bit about numeric instability because grids can be big and SA does its calculations in only single precision. Thus, I would avoid some of the old-fashioned abbreviated formulas that involve subtracting two potentially large sums. In this particular case, a good approach to consider is to standardize each grid first. The correlation is just the average product of the standardized grids. A subtle point is that you must first mask the input grids to their common overlap (which I think is something you have already done). Having done that, zonal summaries of [X] and [Y] will produce grids of their means ([mX] and [mY], say) and standard deviations ([sX] and [sY]). It seems stupid to output the zonal summaries as grids--each one has just a single constant value--but it is the easiest way to go. With these grids in hand, the standardized ones are computed easily as
    xi = ([X] - [mX])/[sX]
    and
    eta = ([Y] - [mY])/[sY]
    Finally, compute the zonal average of [xi]*[eta]. You probably want to output this in table form: it will be a table with a single record (for the one zone you're working in) and a single field (the average of [xi]*[eta]). That number is the correlation coefficient.

    NB: This procedure uses a somewhat non-standard definition of correlation because SA does not compute standard deviations in a standard way (!). However, it differs by so little from the standard definition that there's no reason to worry. I mention it in case you are double-checking your results against other formulas or software, perhaps using small test grids where the difference will show up in one of the later decimal places.

    I'll end with a comment: the correlation between [NDVI] and an interpolated rainfall grid may tell you more about the interpolation method than about rainfall. To avoid confounding the interpolator with rainfall, consider extracting [NDVI] values at the points where you have rainfall measurements to produce a table of ([NDVI], [Rainfall]) pairs. Compute the correlation between these in the usual way (via field calculations, a spreadsheet, a statistical package, or whatever). This is generally a good approach, but it's especially worth considering for rainfall, which is notoriously difficult to interpolate accurately.
    Last edited by whuber; 09-28-2010 at 07:01 AM.
    --Bill Huber
    Quantitative Decisions
    For more help, visit the worldwide community at http://gis.stackexchange.com

  3. #3
    Alan Fernandes
    Join Date
    Sep 2010
    Posts
    8
    Points
    0
    Answers Provided
    0


    0

    Default Re: Pearson correlation in raster datasets

    Bill,

    Thank you very much for your reply. I will try what you suggested and I should be coming back here with either some results for discussions (or more questions!) if that's okay.

    Just a couple of things regarding the interpolation. My area of study is data scarce and I didn't have a choice. I would have only three points of NDVI where the data was actually collected. I think I will take the risk and still use the interpolation/NDVI analysis - or do you think the results would be worthless?

    Regards,
    Alan.

  4. #4
    Alan Fernandes
    Join Date
    Sep 2010
    Posts
    8
    Points
    0
    Answers Provided
    0


    0

    Default Re: Pearson correlation in raster datasets

    Hi again Bill, I hope you'll see this.

    I'm trying to get my head around some of your instructions, but with no success. Now I have the indicator grid for the overlapping datasets, but I'm very confused with how to use it in the calculations. I don't think I quite understood the following:

    •Wherever you see an ordinary unary or binary arithmetic operation in the formula (including, but not limited to, negation, addition, subtraction, multiplication, division, and powers), use the same one in Map Algebra.
    •Wherever you see a sum over cases compute a zonal sum. If the sum is divided by N it's really an average so just use a zonal average for the whole thing
    That went straight over my head - really sorry.
    If you have the chance to clarify that - it would be so much appreciated. I will keep on trying in the meantime.

    Regards,
    Alan.

  5. #5
    William Huber

    Join Date
    Apr 2010
    Posts
    694
    Points
    73
    Answers Provided
    2


    0

    Default Re: Pearson correlation in raster datasets

    If you have NDVI and rainfall at only three common points, you are relying almost entirely on the interpolation, Alan. My first choice would be to get the NDVI at the other rainfall measurement locations. If that's truly difficult, at least consider a sensitivity analysis: vary the interpolation as radically as you can. (This can be done well with stochastic simulation; but if you don't want to get into that, use some sharply different interpolators, such as nearest-neighbor, IDW with several different powers, and cubic splines.) The results that hold up regardless of the interpolation method have some claim to being real.

    As far as the mechanics of translating formulas to raster operations go, it's really simple. Consider first a textbook formula for standardizing data. Let the data be (x_1, x_2, ..., x_n). Call their mean m and their standard deviation s. The formula for the standardized values xi_i, i = 1, 2, ..., n, would read
    xi_i = (x_i - m)/s.
    On a raster the indexes i represent individual cells. Therefore the mean and standard deviation are summaries of the data over the grid: they are numbers. I'll get to that in a moment, to show how to create two new grids whose values everywhere are equal to m and s, respectively. With map algebra you don't have to reference the indexes, because all local operations are automatically done cell-by-cell. So the map algebra syntax just drops the i's and puts brackets [] around the names of existing raster datasets to indicate they are grids, as in
    xi = ([x] - [m])/[s]
    (Notice the lack of brackets around the name of the new grid. Also, the new name "xi" refers to the Greek letter corresponding to "x"; it has nothing to do with "x" and the index "i" mashed together!)

    Next consider a formula involving a summation. The mean is a good example, where a textbook formula is
    m = Sum{x_i, i = 1 to n} / n
    You obtain such summaries with zonal statistics. The zonal average of an entire grid does the following:
    • For each category in the zone grid, it locates the cells having that category.
    • For each such cell in the category, it obtains the value in the corresponding cell of the value grid.
    • It computes the desired statistical summary, such as the mean, sum, standard deviation, or whatever.
    • It copies that summary value into every corresponding cell of the zone grid.
    Thus, this textbook formula is implemented in two steps:
    1. Create a zone grid having one constant value at every cell of the [x] grid.
    2. Compute the zonal average grid for [x] based on this zone and call it m.
    A zonal standard deviation will compute the s grid for you.

    Zonal summaries are available from the SA menu, as ArcTools, and via the Raster Calculator, Command Line, SOMA tool, and Model Builder.
    --Bill Huber
    Quantitative Decisions
    For more help, visit the worldwide community at http://gis.stackexchange.com

  6. #6
    Alan Fernandes
    Join Date
    Sep 2010
    Posts
    8
    Points
    0
    Answers Provided
    0


    0

    Default Re: Pearson correlation in raster datasets

    Thanks a lot for that, Bill. You've been a great help. Very much appreciated.

  7. #7
    Alan Fernandes
    Join Date
    Sep 2010
    Posts
    8
    Points
    0
    Answers Provided
    0


    0

    Default Re: Pearson correlation in raster datasets

    Bill,

    Following the instructions you gave me above, I still have some doubts if what I'm doing is correct.
    I followed the steps, but some there's some confusion:

    In the formula xi_i = (x_i - m)/s, for instance, the zone grid is x_i, correct? I'm not sure how to obtain that zone grid (and I'm sure this is really simple!). Is this anything to do with the indicator grid I created initially?

    Another question is regarded the zonal average of [xi]*[eta] that you mentioned. Do I obtain that by simply multiplying those two grids and dividing it by N?

  8. #8
    Alan Fernandes
    Join Date
    Sep 2010
    Posts
    8
    Points
    0
    Answers Provided
    0


    0

    Default Re: Pearson correlation in raster datasets

    Hi Bill,

    I think I've figured it out now. And yes, it was very simple. But feel free to make any comments. I'm new to all this, so any reassurance is more than welcome.

    Many thanks.
    Alan.

  9. #9
    William Huber

    Join Date
    Apr 2010
    Posts
    694
    Points
    73
    Answers Provided
    2


    0

    Default Re: Pearson correlation in raster datasets

    Quote Originally Posted by alanfer17 View Post
    In the formula xi_i = (x_i - m)/s, for instance, the zone grid is x_i, correct? I'm not sure how to obtain that zone grid (and I'm sure this is really simple!). Is this anything to do with the indicator grid I created initially?
    There's only one zone grid. It determines which cells will be summed over. (More generally, a zone grid can contain multiple categories. Zonal sums are then performed separately for the cells within each zone. This lets you do a whole lot of calculations at once, provided only that they take place over non-overlapping regions.) Thus, whenever you see something like

    \Sigma_{i=1}^{n} ...

    in a textbook, you know it will be implemented as a zonal sum (or zonal average). All the stuff coming after the summation symbol \Sigma has to be computed, cell by cell, in advance so you can sum over its cells.


    Another question is regarded the zonal average of [xi]*[eta] that you mentioned. Do I obtain that by simply multiplying those two grids and dividing it by N?
    Just multiply: the average automatically takes care of the division by N.
    --Bill Huber
    Quantitative Decisions
    For more help, visit the worldwide community at http://gis.stackexchange.com

  10. #10
    Alan Fernandes
    Join Date
    Sep 2010
    Posts
    8
    Points
    0
    Answers Provided
    0


    0

    Default Re: Pearson correlation in raster datasets

    Bill,

    I'm getting very strange results for the correlations, so I'm assuming there's something wrong with the way I'm creating the zone grid, unfortunately. I don't seem to get this one constant value at every cell that you mentioned above.
    Is the formula for the standardized values xi_i = (x_i - m)/s the one that should be used to create this grid?
    I've used the indicator to produce grids of the means ([mX] and [mY], and standard deviations ([sX] and [sY]). But in the formula above, what will be the x_i? I know this should be a very straight forward operation, but I'm sure what I'm doing is wrong.

    Thank you.
    Alan.

  11. #11
    William Huber

    Join Date
    Apr 2010
    Posts
    694
    Points
    73
    Answers Provided
    2


    0

    Default Re: Pearson correlation in raster datasets

    There is only one zone grid for the entire set of calculations, Alan: it must have a constant value at all cells that will be involved in your calculation and have NoData values elsewhere. A zonal sum of some other grid "Z" using this grid for the zones is the same as performing a summation over the values found in the cells of [Z] that overlap the zone cells. If we index those cells with a letter "i" and let their count be "N", so that i runs from 1 through N, then a zonal sum of [Z] is obtained as the sum of the z_i for i = 1 to N and the zonal mean of [Z] is obtained as 1/N times the zonal sum. Thus, for example, if [Z] has been computed as ([x] - [m])/[s] for grids [x] (the original data), [m] (a constant grid equal to the mean of [x] throughout the zone), and [s] (a constant grid equal to the sd of [x] within the zone), then by definition cell i of [Z] is the value (x_i - m_i)/s_i and the zonal mean of [Z] equals
    Sum_i=1^n { (x_i - m_i)/s_i }

    = Sum_i=1^n{ (x_i - m)/s }
    due to the constancy of [m] and [s]. Thus [Z] contains the standardized values of [x].
    --Bill Huber
    Quantitative Decisions
    For more help, visit the worldwide community at http://gis.stackexchange.com

  12. #12
    Alan Fernandes
    Join Date
    Sep 2010
    Posts
    8
    Points
    0
    Answers Provided
    0


    0

    Default Re: Pearson correlation in raster datasets

    Bill,
    My question remains:
    if I’m computing Z as ([x] - [m])/[s] for grids [x] (the original data), [m] (a constant grid equal to the mean of [x] throughout the zone), and [s] (a constant grid equal to the sd of [x] within the zone),
    by [x] (original data) do you mean:
    1) – the sum of [x] using the INDICATOR grid (created earlier) in the “feature zone” field on SA Zonal Statistics, VALUE in the SA “zone field”, and the ORIGINAL DATA [X] as the “input raster” field?
    If that’s the case, my result will be a range of values (high to low), which will give me 2 values for a correlation in the end. Note that I will be doing the same thing for the [Y] original value, which will also give me “high to low” values.
    2) – if by [x] (original data) you mean the raw raster data (say, my NDVI data), I will also have “high and low” values.
    Basically, my only question is: what exactly is [x] in the formula ([x] - [m])/[s]?

    Sorry for being such a hard work.

  13. #13
    William Huber

    Join Date
    Apr 2010
    Posts
    694
    Points
    73
    Answers Provided
    2


    0

    Default Re: Pearson correlation in raster datasets

    Quote Originally Posted by alanfer17 View Post
    ... if I’m computing Z as ([x] - [m])/[s] for grids [x] (the original data), ... what exactly is [x] in the formula ([x] - [m])/[s]?
    The answer is right there in your question, Alan! [x] holds the original data and [z] will hold the standardized version of the same data.

    (The change from [x] to [z] is exactly like the change from measuring a temperature in degrees Fahrenheit to degrees Kelvin: there is a shift of the zero, accomplished by subtracting [m], and a change of units, accomplished by dividing by [s]. Because the average of [x] is [m], the average of [z] -- being [m] less -- is zero; because the standard deviation of [x] is [s], the standard deviation of [z] -- being 1/[s] as large -- is one.)

    The principle here is that the Pearson correlation between any two grids [x] and [y] is their average product, provided both [x] and [y] have been re-expressed to have averages of zero and standard deviations of one. Thus, you need only know how to accomplish three things: standardizing values of a grid, multiplying two grids pointwise (that is, a local multiplication), and finding their averages.

    Finally, note that the average value of any grid (such as the product of the standardized 'x' and standardized 'y') is always a single number; there's no possibility here of obtaining a range of answers!
    --Bill Huber
    Quantitative Decisions
    For more help, visit the worldwide community at http://gis.stackexchange.com

  14. #14
    Phil Morefield
    Join Date
    Apr 2010
    Posts
    133
    Points
    59
    Answers Provided
    8


    0

    Default Re: Pearson correlation in raster datasets

    This is a great example of how incorporating R Statistical Software into ArcGIS would be a huge benefit. R has no problem calculating correlations on data that are provided as an "array", i.e. rows and columns. There is no need to reinvent the wheel, although it does make for an interesting excercise.

    My advice would be to abandon ArcGIS for the heavy lifting and just download R. It's notoriously ugly to program, but with a few Google searches I bet you could write correlation scripts in a couple of hours. Using Python, R, and rpy2 I've written script tools to automate this procedure. But....

    One issue I haven't been able to resolve is the reliability of correlation coefficients when the data are spatially autocorrelated, which may be a problem.

    I'll get off my soapbox now, but this is an area in which I would love to see more improvement.

  15. #15
    William Huber

    Join Date
    Apr 2010
    Posts
    694
    Points
    73
    Answers Provided
    2


    0

    Default Re: Pearson correlation in raster datasets

    Quote Originally Posted by philmorefield View Post
    This is a great example of how incorporating R Statistical Software into ArcGIS would be a huge benefit. ...
    Excellent points. Using R to compute correlation coefficients is overkill: you can compute them easily enough with map algebra (as I outlined). But R offers (literally) thousands of sophisticated capabilities that are difficult or impossible to compute with map algebra. A seamless GIS-R connection would be a powerful, extensible tool.

    One issue I haven't been able to resolve is the reliability of correlation coefficients when the data are spatially autocorrelated, which may be a problem.
    Correlation coefficients are not made "unreliable" by the presence of spatial autocorrelation. What one has to be concerned about is the size of estimation errors in regression coefficients when the residuals of the dependent variable are spatially autocorrelated. This, IMHO, is less of a problem with true grids of data (that is, with grids not derived by interpolating or resampling sparser datasets). In general, assessing the potential for unreliability in estimates depends on the application, the data, how the data are modeled, and what decisions the analysis is intended to support.
    --Bill Huber
    Quantitative Decisions
    For more help, visit the worldwide community at http://gis.stackexchange.com

  16. #16
    Guy Duke
    Join Date
    Apr 2010
    Posts
    44
    Points
    3
    Answers Provided
    0


    0

    Default Re: Pearson correlation in raster datasets

    In ArcInfo workstation this was as easy as typing the following at the grid prompt.

    correlation grid1 grid2

    ESRI - why have we lost functions with ArcGIS?

  17. #17
    Jessica Berryman
    Join Date
    Mar 2011
    Posts
    2
    Points
    0
    Answers Provided
    0


    0

    Unhappy Re: Pearson correlation in raster datasets

    What is the difference between using the above procedure and using band statistics to calculate the correlation between rasters? I am hoping to find the correlations between my environmental variables prior to further analysis in maxent. Any thoughts would be greatly appreciated.
    Thanks,
    Jessica

  18. #18
    William Huber

    Join Date
    Apr 2010
    Posts
    694
    Points
    73
    Answers Provided
    2


    0

    Default Re: Pearson correlation in raster datasets

    The band statistics procedure does compute correlations and covariances. Unlike almost all other related calculations (of correlations, covariances, variances, and standard deviations) throughout ArcGIS, it uses 1/(n-1) in the denominator rather than 1/n. However, for all but the tiniest grids this difference in covariances is inconsequential and it makes no difference at all for the correlations (except for grids with exactly one non-NoData cell).
    --Bill Huber
    Quantitative Decisions
    For more help, visit the worldwide community at http://gis.stackexchange.com

  19. #19
    Sarah Pene
    Join Date
    Dec 2011
    Posts
    2
    Points
    0
    Answers Provided
    0


    0

    Default Re: Pearson correlation in raster datasets

    Like Jessica I am also trying to find which of my environmental layers are highly correlated before using them in Maxent. The correlation matrix that it produces has many values outside of the -1 to 1 range, can anyone tell me what the problem is there? (I'm using the worldclim 1-19 layers, 30arc seconds, clipped to Europe).

  20. #20
    William Huber

    Join Date
    Apr 2010
    Posts
    694
    Points
    73
    Answers Provided
    2


    0

    Default Re: Pearson correlation in raster datasets

    Exactly how are you obtaining the correlation matrix, Sarah, and how far outside the [-1,1] range are the values? Several methods have been mentioned in this thread. If you follow the instructions I gave at the outset, it is mathematically impossible to obtain values outside the range [-1,1]. Just to be sure, I have executed that workflow with three small grids and it obtained correct values for the inter-grid correlations. For the diagonal entries, which should all be 1.0, it computed values that were a tiny bit too high (in the last decimal place) due to floating point roundoff error.
    --Bill Huber
    Quantitative Decisions
    For more help, visit the worldwide community at http://gis.stackexchange.com

+ Reply to Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts