STATISTICS PRACTICAL – Environmental Data Acqusition and Analysis

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

STATISTICS PRACTICAL – Environmental Data Acqusition and Analysis

Statistics Practical

MSc Conservation and MSc ACER

PREPARATION – software and opening / accessing “desktop@ucl” and the App store

You will likely need to work within the “desktop@ucl”, a software environment which contains the programmes and software packages required (some only after they have been uploaded onto your work environment using the respective UCL App store) and is accessible remotely to all students registered to UCL. Please, ensure that you have a strong virus protection on your computer whenever you are logging into UCL networks. Generally, UCL also offers software packages (including some of the ones we are using in this practical) for download onto your own machines as long as you are enrolled at UCL. Please, see the following link: https://www.ucl.ac.uk/isd/how- to/ucl-software-database-accessing-software-database .

In the meantime, if you are NOT working on a standard UCL computer (where you should be able to automatically log into desktop@ucl – unless you are using one of our UNIX machines, which in this context are not “standard” UCL computers) and want to start desktop@ucl, I believe the following is the best way forward:

1. Get direct access to the UCL network using the “ UCL Virtual Private Network (VPN)” if possible (as far as I understand, this will mean that you are then technically directly working within the UCL network, i.e. you can use your computer in the same  way anyone would when they used the computers on campus) . This I believe will allow you to work with a faster and more secure connection when accessing UCL-based files and programmes, and allows you also to more easily transfer files between the virtual desktop@UCL environment and your own computer. Information on how to install the VPN on your computer can be found from links on the page:

https://www.ucl.ac.uk/isd/services/get-connected/ucl-virtual-private-network-vpn

2. Download and install the Citrix Workspace and log in to Desktop@UCL Anywhere

As the name suggests, you can use the desktop@ucl fully remotely , even if you are not connected / cannot connect to the UCL VPN – but I would still recommend you to first install and log into the VPN prior to opening the desktop if you can. In any case, please see and follow the instructions at the following link:

https://www.ucl.ac.uk/isd/services/computers/remote-access/desktopucl-anywhere

As stated in the instructions for this service, I would highly recommend that you first download and install the Citrix Workspace, before logging into the desktop@UCL Anywhere service, since otherwise (i.e. if you just use the normal browser option,  which is in a way the “last resort” way forward), you will not have access to the full variety of options for you.

I RECOMMENT THAT YOU THEN ALWAYS WORK WITHIN THE DESKTOP@UCL ENVIRONMENT FOR THIS PRACTICAL!

3. As I noted above, if you use the UCL VPN, you can also directly get access to your personal, private OneDrive file storage on the UCL system that given you 100GB of storage space and is constantly backed up – and which can be accessed I believe from any computer running the UCL VPN network client as well as generally via the desktop@UCL. The only problems with this can occur if there are general problems   with the UCL network. It will be a good idea to actually use the OneDrive for all your work for this practical (and potentially also for any other UCL-related work, i.e. most of you I would imagine are not going to use in excess of 100GB data space) – but to  also keep an additional back-up version of your most up-to-date versions of work on  your own computer hard-drive or a separate hard-drive. If you are directly connecting the OneDrive to your own computer as effectively a network drive, you can very easily transfer data / files backwards and forwards, allowing you hence easy access to the regularly updated and secure additional storage space that is the UCL-managed OneDrive. For further information and instructions, please see:

https://www.ucl.ac.uk/isd/services/file-storage-sharing

4. There are a number of software packages that I would highly recommend you download onto your own computer, if you have the capacity. This will make working through the practical more robust, since you won’t be relying on the internet connection for the respective parts of the practical (although I would still recommend that you upload the final results of your analysis onto your OneDrive). I hence recommend you to download and install the following software packages on your computer:

1. From the UCL software database: SPSS XX (ensure that you install this as stand- alone, not as “on-site” (i.e. concurrent license) installation – see download guidance  document). XX stands for the newest version available. The direct link to the UCL database SPSS page should be http://swdb.ucl.ac.uk/package/view/id/1?filter=SPSS

2. For biodiversity calculations, I would strongly recommend that you download EstimateS from https://www.robertkcolwell.org/pages/1407. The current version is 9.1, while I believe some versions run on the desktop are slightly older. Either should nonetheless work fine for the practicals – although they might not work on MAC computers. In these cases, you might have to briefly use a UCL computer on campus.

Chapter 1 – Initial investigation and pre-processing of the data

1.1 Initial investigations:

•   Open your Moodle course pages for GEOG0106 and locate the “historic data” excel data file provided (“EDAA 2020.xlsx”). Copy this file into your work directory (into OneDrive or your UCL desktop - where you can easily remember the path).

•   Open the file in Excel. Explore the data sheets. As you can see, the data has   been split into multiple worksheets containing both species (named “Plants” or “Beetles”, respectively) data and environmental (named “Env”) data for the saltmarsh (SaltMa) and the Ponds, with an additional spreadsheet detailing the management of the ponds, and with further empty spreadsheets to be   filled by you with the processed data for multivariate analysis.

Please, note: for the saltmarsh, this is split into two sections by a shingle ridge, resulting in roughly separated sections of lower marsh (LM) as the area closest to the sea, the shingle ridge (SR) and the upper marsh (UM) from the shingle ridge to the terrestrial boundary of the saltmarsh environment. Please, also note that I have allocated some plots close to both the SR and one of the saltmarshes as provisional ecotone plots (LM/SR and SR/UM). This allocation might make sense, but should be looked at carefully, i.e. you might treat this as a rough, first classification. For the ponds, you will see that some ponds’ information is actually written in red – these ponds are unrestored (UR) – or at least have been unrestored up to the point when the survey was taken. A lot of the ponds are furthermore located at Manor Farm (MF) and in the surrounding landscape, while I also included information from ponds that are located at a little distance to Manor Farm, and where I have included both data for an unmanaged state prior to the restoration event, as well as information post-restoration. This set of ponds is marked in green. You will also see that there are some orange markings in some of the spreadsheets where we did not have records for the respective parameters.

•   Open and explore one of the two Environmental Parameters (…env…) worksheets (for now, it does not matter which one, since I will ask you to repeat this with the other environmental worksheet once you have finished your exercises for “Chapter 1” with the first set of data:

a)  Which parameters are ECOLOGICALLY MEANINGFUL?  This question relates to ecological understanding you might already have of the respective ecosystems we are working with here. If you are not sure, you can keep all variables here, as well as under points b) and c) below. We   can discuss these questions in more detail during our synchronous group meetings next week …

-    Delete all variables which seem irrelevant from an ecological perspective. Take notes what you have deleted and why

b)  Which parameters show very LITTLE VARIATION along the gradient? Do you want to delete them, or is it worthwhile keeping them nonetheless? Take account of the scales, e.g. log scale for pH values.

Again, please take notes what you have deleted and why …

-    Delete variables which you believe might not provide relevant information.

Again, please take notes what you have deleted and why …

c)  Which parameters are very RARE (or commonly below detection limit)?

-    Delete any rare parameter(s) that you think are dispensable, again taking notes what you have deleted and why …

d)  Which parameters show MISSING VALUES (Be careful – you can ONLY   use complete data-sets, which means you might have to delete some plots if you want to run any multivariate analysis with INCOMPLETE data-sets.   Please, NEVER just replace missing data with zero!)?

Please note down the variable(s) and plot number(s) of the respective case(s) for later consideration in the analysis:

e)  Are there any obvious OUTLIERS in the dataset?

To investigate this question, create FOUR NEW ROWS in your excel worksheet entitled:

- MEAN: containing the mean (“=AVERAGE(…)”) for each remaining environmental parameter

- 4 STD: calculate 4* standard deviation (“=4*STDEV(…)”) for each parameter,

- MAX: one row containing the maximum value for each parameter (“=MAX(…)”), and

- MIN: one row containing the minimum value for each parameter (“=MIN(…)”)

Make sure that your new cells are formatted in Excel as numbers with TWO decimal places. The “(…)” above indicates the entirety of the fields containing values / measurements for an individual environmental parameter.

Are there any cases where the maximum or minimum is outside the range of the Mean ± 4 StD-units? (you may want to create two additional rows: Mean + 4STD and Mean - 4STD to check this conveniently, using the formulae “=AVERAGE(…)+(4*STDEV(…))” and “=AVERAGE(…)+(4*STDEV(…))” to directly compare with your Min and Max values from above) Are the respective values unreasonable, i.e. can these be clearly related to an error in the analysis?

Please note down the variable(s) and plot number(s) of the respective case(s) for later consideration in the analysis:

ENSURE THAT YOU DELETE ALL NEW ROWS (or columns) THAT YOU HAVE JUST CREATED UNDER e), and SAVE THE FILE under the new name “PONDSTAT.xls” or “SALTMARSHSTAT.xls” before CLOSING it.

1.2 Preposessing / data transformation

• Open SPSS XX that you earlier installed on your computer, or locate SPSS

on the desktop@UCL start menu (I believe the version installed on the desktop@UCL might still be an earlier one – I have adapted this practical for a stand-alone version of SPSS vs. 26, so there might be slight differences as to  where you find different options – please, use the search function if you encounter any problems) and open the program.

•    From the start menu, Select Open another file; change the “files of type”-

settings to Excel Data ..., navigate to your working directory (e.g. OneDrive or wherever you stored the Excel file) and open the file “PONDSTAT.xls” or “SALTMARSHSTAT.xls” . Select your Worksheet with the Environmental  Parameters you have worked on. Tick the box “Read variable names …” and leave other options as they are. Press “OK”.

•   Check for normal distribution in each environmental parameter, and mark in the table below which parameter seems problematic:

-   In SPSS, choose Analyze – Descriptive Statistics - Frequencies,  select all environmental parameters as variables, make sure that all boxes under “Statistics” are NOT ticked, and at the “Charts” menu, select

“Histograms” and “Show normal curve … ”. Select “Continue” and

OK”. Wait, as the calculations can take a few moments... Now, scroll down the ‘Output’ window of SPSS to locate the “ Histogram” section. You will see a graph with “Frequency” and the respective parameter as axis labels, a black line showing the expected normal distribution for the data,  and in blue the histogram, i.e. the real distribution of the data. Ideally, the  blue bars should be closely aligned to the black line. The bar chart for pH  from the saltmarsh data for example does not look too bad in this respect.

Take a good look at the histograms and the normal curves. Which variables seem problematic? Which transformations do you suggest? Take notes in the table on the next page.

-     in SPSS, now choose Analyze – Descriptive Statistics – Q-Q  .

Select all environmental parameters as variables. Test distribution” should be set to “normal” and leave other options unchanged. Press OK. The resulting Q-Q-Graphs (don´t worry about the de-trended graphs!) should show that all points are located very closely along the straight line.

Which parameters show a good fit, and which ones might need to be transformed? Againnote your conclusions in the table at the next page.

-    Close the output box (without saving) and rerun the analysis, but this time ticking the “Natural log transform” box.

What has changed? Do the problematic parameters show a better fit, now?

-    Now, run the Kolmogorov-Smirnov Test to evaluate which distributions deviate significantly from the normal distribution.

In SPSS, select Analyze – Nonparametric Tests – Legacy dialogs – 1-sample K-S  select all parameters, make sure “Normal” is selected (you are testing for normal distribution) and click “OK” .

Env.

Para-

meter

Histo-

gram

Q-Q-

graph

K-S-

Test

The important figure in the resulting test table is the number under “Asump. Sig  . . REMEMBER THAT A SIGNIFICANT RESULT (<0.05) shows that your parameter

DOES NOT FOLLOW A NORMAL DISTRIBUTION!!! (i.e. your null hypothesis is that your parameters are normally distributed).

Note down again in the table above which parameters are significantly different from a normal distribution.

Now close SPSS for the time being (without saving) and go back to your original Excel spreadsheet.

•   Transform all environmental parameters which you believe should be

transformed (obviously based on the table on the last page) using log (base 10). Remember to use “=log( …+1)” where you have 0 values, as log (0) is not defined. If you have negative numbers, add a value consistently to all data points so that the minimum value you get prior to transformation is “1”.

•   Save the file, and re-run the Kolmogorov-Smirnov Test in SPSS with the transformed data.

Do all transformed parameters show a normal distribution? What can you do where this is not the case?

•   The next step is to THINK about the data with regard to degrees of freedom in your variables: Is there any set of variables which is inter-correlated – i.e. are  there cases where one of the parameters results automatically from the value (s) of (an)other parameter(s)? If you have such a group of parameters,    you NEED TO REDUCE this group by one parameter!!! (Normally, this will be  the most “problematic” parameter within the group, e.g. the one not following a normal distribution... )

Which parameters should be deleted - if any ???

•   The next and final step in your data preparation is the analysis of correlations between different parameters. The basic idea is that you do not really want to include parameters which basically tell you (near-) identical information, but more importantly, a lot of the analyses we will be running with the data require the environmental parameters to be mathematically INDEPENDENT!

•    Reduce and clean the “Environmental variables” table in your

PONDSTAT.xls” or “SALTMARSHSTAT.xls” file so that it only includes the remaining original variables which did not need to be either omitted or transformed, and the transformed versions of the remaining variables (apart obviously from the ones deleted for other reasons).

•   Save the file and open it again in SPSS. Here, run a correlation analysis:

SPSS – Analyze – Correlate – Bivariate. Select all environmental

parameters, and run the analysis for both Pearson and Spearman Correlation  Coefficients by ticking the respective boxes (two-tailed, since you do not know where correlations will be expected). PLEASE, BE CAREFUL: YOU MIGHT

INITIALLY NEED TO DELETE SOME OF THE PLOTS IF NOT ALL

PARAMETERS WERE MEASURED EVERYWHERE IN SOME DATASETS (for future reference…). COPY YOUR ENVIRONMENTAL PARAMETERS    INTO A SEPARATE SPREADSHEET TO DO THIS. IDEALLY, IF TWO

PARAMETERS ARE STRONGLY CORRELATED, BUT ONLY ONE HAS  DATA GAPS, SELECT THIS VARIABLE FOR DELETION, SO THAT YOU END UP WITH AN OPTIMIZED DATASET WITH FEW DATA GAPS.



发表评论

电子邮件地址不会被公开。 必填项已用*标注