Data Analysis – Particulate Matter in Cities


PM2.5 readings are often included in air quality reports from environmental authorities and companies. PM2.5 refers to atmospheric particulate matter (PM) that have a diameter less than 2.5 micrometers. In other words, it’s used as a measure of pollution.

To verify PM2.5 Particulate Matter small particle pollution in China, measurements were taken hourly from 5 key cities over a period of several years (depending on city) resulting in a set of 52,854 records held on the UCI and Kaggle.
Our aim is to verify factors contributing to higher concentrations of this PM, using Linear Regression and PCA.

Naturally, we can only assess the environmental factors measured in this data, which does not include the amount of emissions from industry and transport etc. But we can assume those emmisions are generally consistant (although may well be subject to large increases or decreases).

Executive Summary

There is a big seasonal variation although overall, Beijing is most polluted, with Chengdu, Shenyang, Guangzhou and Shanghai in that order being less polluted. North West and South West winds are much more likely to reduce the PM2.5 level.

As Dewpoint increases, and to a lesser extent Pressure, so does the concentration of pollutant PM2.5.

The lower the precipitation, the higher the pollution.

The annual effect actually shows that PM2.5 is going down each subsequent year, by around 2.65 ug/m^3 per year.

Data Description

The time period for this data is between Jan 1st, 2010 to Dec 31st, 2015. Missing data are denoted as NA.

  • No: row number
  • year: year of data in this row
  • month: month of data in this row
  • day: day of data in this row
  • hour: hour of data in this row
  • season: season of data in this row
  • PM: PM2.5 concentration (ug/m^3)
  • DEWP: Dew Point (Celsius Degree) – the temperature at which condensation forms
  • TEMP: Temperature (Celsius Degree)
  • HUMI: Humidity (%)
  • PRES: Pressure (hPa)
  • cbwd: Combined wind direction
  • Iws: Cumulated wind speed (m/s)
  • precipitation: hourly precipitation (mm)
  • Iprec: Cumulated precipitation (mm)

Data Preparation

Records and measurements are fairly clean so we can read them and check structure.

# Read data from files
BeijingPM <- read.csv("Data/BeijingPM20100101_20151231.csv")
ChengduPM <- read.csv("Data/ChengduPM20100101_20151231.csv")
GuangzhouPM <- read.csv("Data/GuangzhouPM20100101_20151231.csv")
ShanghaiPM <- read.csv("Data/ShanghaiPM20100101_20151231.csv")
ShenyangPM <- read.csv("Data/ShenyangPM20100101_20151231.csv")
# Set core variables so we can merge datasets
sharedvars <- c('No', 'year', 'month', 'day', 'hour', 'season', 'PM_US.Post',
'DEWP', 'HUMI', 'PRES', 'TEMP', 'cbwd', 'Iws', 'precipitation', 'Iprec')
BeijingPM <- subset(BeijingPM, select = sharedvars)
ChengduPM <- subset(ChengduPM, select = sharedvars)
GuangzhouPM <- subset(GuangzhouPM, select = sharedvars)
ShanghaiPM <- subset(ShanghaiPM, select = sharedvars)
ShenyangPM <- subset(ShenyangPM, select = sharedvars)
# Add the City before merging
BeijingPM$City <- as.factor('Beijing')
ChengduPM$City <- as.factor('Chengdu')
GuangzhouPM$City <- as.factor('Guangzhou')
ShanghaiPM$City <- as.factor('Shanghai')
ShenyangPM$City <- as.factor('Shenyang')
# Merge them
PM25 <- rbind(BeijingPM,ChengduPM)
PM25 <- rbind(PM25,GuangzhouPM)
PM25 <- rbind(PM25,ShanghaiPM)
PM25 <- rbind(PM25,ShenyangPM)
# Check dataframe structure

## ‘data.frame’: 262920 obs. of 16 variables: ## $ No : int 1 2 3 4 5 6 7 8 9 10 … ## $ year : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 … ## $ month : int 1 1 1 1 1 1 1 1 1 1 … ## $ day : int 1 1 1 1 1 1 1 1 1 1 … ## $ hour : int 0 1 2 3 4 5 6 7 8 9 … ## $ season : int 4 4 4 4 4 4 4 4 4 4 … ## $ PM_US.Post : int NA NA NA NA NA NA NA NA NA NA … ## $ DEWP : num -21 -21 -21 -21 -20 -19 -19 -19 -19 -20 … ## $ HUMI : num 43 47 43 55 51 47 44 44 44 37 … ## $ PRES : num 1021 1020 1019 1019 1018 … ## $ TEMP : num -11 -12 -11 -14 -12 -10 -9 -9 -9 -8 … ## $ cbwd : Factor w/ 5 levels “cv”,”NE”,”NW”,..: 3 3 3 3 3 3 3 3 3 3 … ## $ Iws : num 1.79 4.92 6.71 9.84 12.97 … ## $ precipitation: num 0 0 0 0 0 0 0 0 0 0 … ## $ Iprec : num 0 0 0 0 0 0 0 0 0 0 … ## $ City : Factor w/ 5 levels “Beijing”,”Chengdu”,..: 1 1 1 1 1 1 1 1 1 1 …

Data Cleaning and Verification

We should check for outliers.

head(PM25[order(PM25$DEWP), ])
head(PM25[order(-PM25$DEWP), ])

DEWP and HUMI (Dewpoint and Humidity) have -9999 values. These outliers are limited to 4 records, so flag them as NA and then remove all NA records.

PM25$DEWP[PM25$DEWP == -9999] &lt;- NA
PM25 &lt;- na.omit(PM25)

Check summary

print(‘Mean and SD of PM2.5 concentration:’)
c(mean(PM25$PM_US.Post), sd(PM25$PM_US.Post))

Now we have reasonable looking quantiles for all measurements, although Precipitation (both hourly and cumulative) and cumulative Wind Speed are very skewed towards higher values. This also applies to our dependent variable PM2.5 (PM_US.Post) although this is more graduated.

Correlations and Preliminaries

Let’s check the measurements by pairs for correlations. # function required for diag panel histograms – from help(pairs) panel.hist <- function(x, …) { usr <- par(“usr”); on.exit(par(usr)) par(usr = c(usr[1:2], 0, 1.5) ) h <- hist(x, plot = FALSE) breaks <- h$breaks; nB <- length(breaks) y <- h$counts; y <- y/max(y) rect(breaks[-nB], 0, breaks[-1], y, col = “cyan”, …) } PM25.pairs <- subset(PM25, select = c(PM_US.Post,DEWP,HUMI,PRES,TEMP,Iws,precipitation,Iprec) ) pairs(PM25.pairs, panel=panel.smooth, diag.panel=panel.hist)

We can see that PM2.5 concentration increases steadily with both Dewpoint and Humidity, indicating positive correlation. Pressure and Temperature show peaked correlations so their lower and higher measurements reflect corresponding decrease and increase in PM2.5. Precipitation shows the less there is, the higher the PM2.5, especially when there is no precipitation (ie no rain). Similar applies to the cumulative Wind Speed and cumulative Precipitation. The latter is collinear to the hourly Wind Speed so will be removed for this analysis.

We can use a generalised additive model to apply non-parametric smoothers which reveal very narrow confidence intervals and slight curvatures in Dewpoint and Humidity. par(mfrow=c(2,2)) model.gam <- gam(PM_US.Post ~ s(DEWP)+s(HUMI)+s(PRES)+s(TEMP), data=PM25) plot(model.gam) # We can also use a tree model which shows the longer the branches, the greater the deviance explained by the model (the PM2.5 levels are shown at end leaf nodes). model.tree <- tree(PM_US.Post ~ year+day+hour+DEWP+HUMI+PRES+TEMP+cbwd+precipitation+City, data=PM25.sub) par(mfrow=c(1,1)) plot(model.tree) text(model.tree) # This classification regression tree shows that City incurs the greatest variance, followed by Dewpoint and then Year.


We can assume that the observations are all independant, especially becasue they are counts per ward and not individual personal cases.

Linear Regression

Linear Regression can now be run, keeping out the Row Number which would cause perfect separation. model <- lm(PM_US.Post ~ year+day+hour+DEWP+HUMI+PRES+TEMP+cbwd+precipitation+City, data=PM25) summary(model) CV(model)

It looks like the measurements are all contribute a significant effect to PM2.5 concentration.

However, the Multiple R-Squared is low, showing that only 21.7% of PM2.5 variation is explained by this model.

Thinking about the dataset, we know it’s effectively a time series spread over a period of years, yet at this stage we are not including any seasonal modeling of variations which may themselves cause the measurement levels to fluctuate. So unless we wish to perform seasonal adjustment (using ARIMA or other modelling techniques) it may be best to focus on just one month, say July. PM25.sub <- PM25[PM25$month==’7′, ] model <- lm(PM_US.Post ~ year+day+hour+DEWP+HUMI+PRES+TEMP+cbwd+precipitation+City, data=PM25.sub) summary(model) CV(model)

Much better – we have almost doubled the Multiple R-Squared explanation of variation to 44% and also enabled the model to focus on more exclusive significance rather than just considering all measurements as previously. We also have a much lower AIC score of 91,200 instead of previous 1,327,000, signifying a better fit of model by penalised log-likelihood.

This all makes sense, because we are now examining dependencies within July across any of 5 cities, and within one month, there will be much less seasonal fluctuation.

So now Temperature is not so significant, nor the SE Windspeed. The Hour of the day is also less significant than before. But as Dewpoint increases, and to a lesser extent Pressure, so does the concentration of pollutant PM2.5.

Also, the lower the precipitation, the higher the pollution.

We can see the cumulative Wind Speed directions have different influence – North West and South West winds are much more likely to reduce the PM2.5 level.

Also note that compared to the base city of Beijing, the other 4 cities have much lower concentrations of PM2.5.

We still have the year significant, so can see annual effect (within any July) – it actually shows that PM2.5 is going down each subsequent year, by around 2.65 ug/m^3.

Let’s check our data assumptions of normality so we can be more confident of these observations. par(mfrow=c(2,2)) plot(model) par(mfrow=c(1,1))

This is not great – the errors on the Q-Q Plot are not completely normal and the error residuals increase against the fitted values, which shows that the variance is not constant.

It looks like this is due to the PM2.5 variable of interest not having a normal distribution, which we could rectify by tranforming it with a logarithm (after verifying it has no negative values, as shown in the quantile summary above). If applied, we check for distribution of this log of the PM2.5. pairs(~log(PM_US.Post) +DEWP+HUMI+PRES+TEMP, data=PM25, panel=panel.smooth, diag.panel=panel.hist)

This has indeed normalised the distribution of the PM2.5. Now let’s check the residual errors again. model <- lm(log(PM_US.Post) ~ year+day+hour+DEWP+HUMI+PRES+TEMP+cbwd+precipitation+City, data=PM25.sub) par(mfrow=c(2,2)) plot(model) par(mfrow=c(1,1))

These are more normally distributed errors, so we will check results of our linear regression on this model, remembering that the PM2.5 is now represented as a log. summary(model)

The R-Squared explanation of variability has gone down from 44% to 39% which is not ideal, although all the other predictors still have their similar weightings. As our original model has the higher R-Squared coverage, we will stick with it. However especially due to the residual error non normality it would be best to apply further transformations to the predictors in various combinations – although there are many combinations particularly with more than a few variables like we have here. Bootstrapping could be used as a non-parametric fit of predictors and their residuals.

Principal Component Analysis

To reduce the number of variables down into a smaller number of components which still represent the variability and correlations between the original variables, we can use Principal Component Analysis (PCA). This is very similar and often confused with Exploratory Factor Analysis (EFA).

First we need a numerical matrix of our dataset, then we can create a correlation matrix where we can see the weight of positive or negative correspondences between each of the variables. PM25.sub2 <- subset(PM25.sub, select=-c(No, cbwd, City, year, month, day, hour, season)) print(pca.cor <- cor(PM25.sub2))

It’s worth viewing the covariance matrix too. print(pca.cov <- cov(PM25.sub2))

Neither are needed directly for the function, but helps us understand the loadings calculated through eigenvalue vectors which are used to create the principal components. pca <- princomp(PM25.sub2) pca$loadings

It’s helpful to view a Screeplot to see the proportions of components which describe the data – it’s usually best (depending on number of original variables) to have 3 or 4 components which describe most of it. It should be a gradual decrease in each component, like a scree face up to a mountain – and not a cliff with everything in the first component (which indicates perfect separation, perhaps a row identifier left in the dataset). screeplot(pca)

Our screeplot looks fine, and we can double check the coverage. summary(pca)

It’s showing that Components 1 to 3 (ie cumulatively) describe 97.8% of the dataset, which is good. We could then use these components to fit models to our data, including linear regression. This makes computation much more efficient and can be easier to handle with less variables, although it can make observation and explanations more difficult unless the components clearly represent specific aspects of the dataset cases.


Our basic model explained 44% of the PM2.5 concentration variability, although this was reduced to 39% when the outcome PM2.5 was tranformed into a log, which reduced the residual errors. We accept that further transformations can be made across the combinations to keep the residual errors low whilst improving the overall fit.

We did see already that there was a big seasonal variation although overall, Beijing was most polluted, with Chengdu, Shenyang, Guangzhou and Shanghai in that order being less polluted. North West and South West winds are much more likely to reduce the PM2.5 level.

As Dewpoint increases, and to a lesser extent Pressure, so does the concentration of pollutant PM2.5.

The lower the precipitation, the higher the pollution.

The annual effect actually shows that PM2.5 is going down each subsequent year, by around 2.65 ug/m^3 per year.

We also used PCA to reduce the number of variables into components which could also be used to model, taking into account the residual errors as above.


This website stores cookies on your computer. Cookie Policy