Modeling Solar Radiation in Hanford, California - Linchuan Yang

Abstract

This project will focusing on Sollar Radiation in Hanford, CA. By using the Box-Jenkins Method, the time-series relationship between time and solar radiation will be analyzed, and a time-series model will be created to forecast the sollar radiation in Hanford, CA.

Introduction

Solar energy is a renewable energy source. Refers to the sun’s thermal radiation energy, the main performance is often said that the sun’s rays. It is generally used for power generation in modern times.Since the birth of creatures on earth, they have mainly survived on the thermal radiation provided by the sun. With the decrease of un-renewable energy sources such as fossil fuels, solar energy has become an important part of the energy used by human beings and has been continuously developed.

The dataset is from Earth System Research Laboratory, Global Monitoring Laborator. The dataset contains the solar radiation data of Hanford, CA from Feburary 1st, 2002 to current time.

Source:
Hanford, California, United States (HNX) Continuous in-situ measurements of solar radiation.Earth System Research Laboratory, Global Monitoring Laboratory
URL: https://gml.noaa.gov/aftp/data/radiation/solrad/hnx/

Variables:

var name	var type	var description	var used
year	integer	year, i.e., 2002	x
jday	integer	Julian day (1 through 365 [or 366])
month	integer	number of the month (1-12)	x
day	integer	day of the month(1-31)
hour	integer	hour of the day (0-23)
min	integer	minute of the hour (0-59)
dt	real	decimal time (hour.decimalminutes),e. g., 23.5 = 2330
zen	real	solar zenith angle (degrees)
dw_psp	real	downwelling global solar (Watts m^-2)	x
direct	real	direct solar (Watts m^-2)
diffuse	real	downwelling diffuse solar (Watts m^-2)
uvb	real	global UVB (milliWatts m^-2)
uvb_temp	real	UVB temperature (C) – 25 deg. C is normal
qc_dwpsp	integer	quality control parameter for downwelling global solar (0=good)
qc_direct	integer	quality control parameter for direct solar (0=good)
qc_diffuse	integer	quality control parameter for diffuse solar (0=good)
qc_uvb	integer	quality control parameter for UVB irradiance (0=good)
qc_uvb_temp	integer	quality control parameter for the UVB instrument temperature (0=good)
std_dw_psp	real	standard deviation of the 1-sec. samples for global solar (dw_psp)
std_direct	real	standard deviation of the 1-sec. samples for direct solar
std_diffuse	real	standard deviation of the 1-sec. samples for diffuse solar
std_uvb	real	standard deviation of the 1-sec. samples for uvb

This project is only focusing on three variables(year, month, dw_psp), therefore, the raw data will be preprocessed to monthly data. After, the data are split to two set, training set(2003.01-2020.12) and testing set(2021.01-2021.12) for model validation.

This project is strictly following the Box-Jenkins Modelling Method.
During model identification, data is differenced first with apporiate differencing times by using Augmented Dickey-Fuller test. By ploting the acf and pacf plot of differenced data about 72 models are nominated.
During model estimation, 72 nominated models are being checked with AICc criterion. Only best three models are nominated to be diagnosed.
During model diagnose, data are fit with three best nominated models. To choose the best model for model forecating, Ljung-Box test and Shapiro-Wilk test are performed and acf, pacf, histigram, qqplot are plotted to test the independence and normality of the residuals. During model forecast, the training data are fit with the best model and predict the solar radiation values in 2021. To validate the data, the compared graph of predicted value and true value are plotted along with confidence interval of predicted value.

Read Raw Data & Preprocessing

setwd("hnx_solar")
# read_list <- list.files(pattern = "\\.dat")

# merge <- read.table(file = "hnx03001.dat", skip = 2, header = FALSE)

# colnames(merge) = c("year","jday","month","day","hour","min",
#                     "dt","zen","dw_psp","qc_dwpsp","direct","qc_direct",
#                     "diffuse","qc_diffuse","uvb","qc_uvb",
#                     "uvb_temp","qc_uvb_temp","std_dw_psp",
#                     "std_direct","std_diffuse","std_uvb")

# merge[merge == -9999.9] <- NA

# merge <- data.frame(merge["year"], merge["month"], merge["day"], merge["dw_psp"])

# for (file in read_list[-which(read_list == "hnx03001.dat")]) {
#     print(paste("processing:", file))
#     new = read.table(file, skip = 2)
#     colnames(new) = c("year","jday","month","day","hour","min",
#                       "dt","zen","dw_psp","qc_dwpsp","direct","qc_direct",
#                       "diffuse","qc_diffuse","uvb","qc_uvb",
#                       "uvb_temp","qc_uvb_temp","std_dw_psp",
#                       "std_direct","std_diffuse","std_uvb")

#     new[new == -9999.9] <- NA
#     new <- data.frame(new["year"], new["month"], new["day"], new["dw_psp"])

#     merge = rbind(merge, new)
# }

# data = merge %>% group_by(year, month) %>% 
#                 summarize(monthly_dw_psp = mean(dw_psp, na.rm = TRUE))

# write.csv(data, file = "hnx_solar_monthly.csv", col.names = TRUE)

# read from merged monthly data
data = read.csv("hnx_solar_monthly.csv")$monthly_dw_psp
data.train = data[1:216]
data.test = data[217:228]

Since reading from large amount of files will take a long time, for future use, the data has been merged to a single csv file with only needed variables.

There are also bad data that equals to -9999.9, so NA are filled while merging. Since only monthly data were focused, the bad data in monthly downwelling global solar variable where ignored and filled with mean value.

At last, the data are splited into train data(date:2003.01-2020.12) for training and test data(date:2021.01-2021.12) for validating.

Model Identification:

Analysing from the Original Data

In order to start the future analysing, we plot the original time series data first.

# plot the original data
ts.plot(data.train, ylab = "Monthly Mean Downwelling Global Solar (Watts m^-2)", 
                 xlab = "Months(start on 2003.01)")
title("Monthly Mean Sollar Radiation in Hanford, CA")

We can see the seasonality pattern is shown in the plot.

Next, we plot the ACF and PACF plot.

par(mfrow=c(1,2))
acf(data.train, main = "ACF Plot")
pacf(data.train, main = "PACF Plot")

We can see from the ACF is kept bouncing and PACF is dying out.
Since ACF is bouncing, the original data is not good for modelling.
Therefore, we will need to do differencing on the original data.

Dickey-Fuller Test

In order to known the appropriate number of differencing time to the original data, we will do the Augmented Dickey-Fuller Test.

adf.test(data)

## Augmented Dickey-Fuller Test 
## alternative: stationary 
##  
## Type 1: no drift no trend 
##      lag   ADF p.value
## [1,]   0 -1.67  0.0926
## [2,]   1 -4.37  0.0100
## [3,]   2 -3.78  0.0100
## [4,]   3 -2.17  0.0306
## [5,]   4 -1.24  0.2363
## Type 2: with drift no trend 
##      lag    ADF p.value
## [1,]   0  -4.27    0.01
## [2,]   1 -14.45    0.01
## [3,]   2 -20.05    0.01
## [4,]   3 -20.58    0.01
## [5,]   4 -16.97    0.01
## Type 3: with drift and trend 
##      lag    ADF p.value
## [1,]   0  -4.25    0.01
## [2,]   1 -14.42    0.01
## [3,]   2 -20.04    0.01
## [4,]   3 -20.57    0.01
## [5,]   4 -17.04    0.01
## ---- 
## Note: in fact, p.value = 0.01 means p.value <= 0.01

From the test result, we can see taking difference one time and two times is appropriate based on the p-value.

Take First Difference

We will try to take one difference on the original data.

# take first difference
data.diff = diff(data.train)
ts.plot(data.diff)
title("TS Plot After First Difference")

Next, we plot the ACF and PACF plot of the first difference data.

par(mfrow=c(1,2))
acf(data.diff, main = "ACF Plot After First Difference")
pacf(data.diff, main = "PACF Plot After First Difference")

Since there are still seasonal patterns on the first difference data, ACF plot is still showing bouncing pattern, and PACF is still showing dying out pattern.
Therefore, we will try to take another difference on the data.

Take Second Difference

We will try to take the second difference on the original data.

# take second difference
data.diff2 = diff(data.diff)
ts.plot(data.diff2)
title("TS Plot After Second Difference")

Next, we plot the ACF and PACF plot of the second difference data.

par(mfrow=c(1,2))
acf(data.diff2, main = "ACF Plot After Second Difference")
pacf(data.diff2, main = "PACF Plot After Second Difference")

From the time series plot, we can see the data is turning to a white noise pattern and PACF is still showing a die out pattern which are good for modelling.
However, the ACF plot is still showing the bouncing pattern which is not good for modelling.

Model Nominees

Based on the anlysis above, SARIMA model will be an appropriate model for the data due to its seasonal pattern.

Now, let’s decide the parameters for the SARIMA model.
Since the data show seaonality with loop of 1 year, therefore, S should be 12.
Since Q and q is not decidable based on the bouncing patterns of all the ACF plot, so we will try different values(1~12) during the Model Estimation.
Since all PACF plot are showing an expotental decay pattern, therefore, P should be 1.

Since we take first and second difference, d can be either 1 or 2.
For d = 1:
Since PACF plot showing significant after Lag 7, therefore, p should be 7.
For d = 2:
Since PACF plot showing significant after Lag 10, therefore, p should be 10.

Therefore, the nominees are:
For d = 1, p = 7:
\[SARIMA(p=7,d=1,q=1)(P=1,D=1,Q=1)_{S=12}\] \[SARIMA(p=7,d=1,q=1)(P=1,D=1,Q=2)_{S=12}\] \[SARIMA(p=7,d=1,q=1)(P=1,D=1,Q=3)_{S=12}\] \[\vdots\] \[SARIMA(p=7,d=1,q=1)(P=1,D=1,Q=6)_{S=12}\]

\[SARIMA(p=7,d=1,q=2)(P=1,D=1,Q=1)_{S=12}\] \[SARIMA(p=7,d=1,q=2)(P=1,D=1,Q=2)_{S=12}\] \[SARIMA(p=7,d=1,q=2)(P=1,D=1,Q=3)_{S=12}\] \[\vdots\] \[SARIMA(p=7,d=1,q=2)(P=1,D=1,Q=6)_{S=12}\] \[SARIMA(p=7,d=1,q=3)(P=1,D=1,Q=1)_{S=12}\] \[\vdots\] \[\vdots\] \[SARIMA(p=7,d=1,q=6)(P=1,D=1,Q=6)_{S=12}\]

For d = 2, p = 10: \[SARIMA(p=10,d=2,q=1)(P=1,D=1,Q=1)_{S=12}\] \[SARIMA(p=10,d=2,q=1)(P=1,D=1,Q=2)_{S=12}\] \[SARIMA(p=10,d=2,q=1)(P=1,D=1,Q=3)_{S=12}\] \[\vdots\] \[SARIMA(p=10,d=2,q=1)(P=1,D=1,Q=6)_{S=12}\]

\[SARIMA(p=10,d=2,q=2)(P=1,D=1,Q=1)_{S=12}\] \[SARIMA(p=10,d=2,q=2)(P=1,D=1,Q=2)_{S=12}\] \[SARIMA(p=10,d=2,q=2)(P=1,D=1,Q=3)_{S=12}\] \[\vdots\] \[SARIMA(p=10,d=2,q=2)(P=1,D=1,Q=6)_{S=12}\] \[SARIMA(p=10,d=2,q=3)(P=1,D=1,Q=1)_{S=12}\] \[\vdots\] \[\vdots\] \[SARIMA(p=10,d=2,q=6)(P=1,D=1,Q=6)_{S=12}\]

Model Estimation:

Now, for model estimation, we will use AICc value as a criterion to check which model fit the best.

# aiccs <- matrix(NA, nr =6, nc = 6)
# dimnames(aiccs) = list(q = c(1:6), Q = c(1:6))

For d = 1, p = 7:

setwd("hnx_solar")

# Model Estimation for d=1, p=7
# for (q in 1:6) {
#     for (Q in 1:6) {
#         try({
#             aiccs[q,Q] = sarima(xdata = data.train, 
#                             p = 7, d = 1, q = q, 
#                             P = 1, D = 1, Q = Q, S = 12)$AICc
#             print(paste("d1p7: {", " q: ", q, "Q: ", Q, "}"))
#         })
#     }
# }
# write.csv(aiccs, "d1p7_aicc.csv", col.names = TRUE, row.names = TRUE)

read.csv("d1p7_aicc.csv", header = TRUE)

For d = 2, p = 10:

setwd("hnx_solar")

# Model Estimation for d=2, p=10
# for (q in c(1:6)) {
#     for (Q in c(1:6)) {
#         try({
#             aiccs[q,Q] = sarima(xdata = data.train, 
#                             p = 10, d = 2, q = q, 
#                             P = 1, D = 1, Q = Q, S = 12)$AICc
#             print(paste("d2p10: {", " q: ", q, "Q: ", Q, "}"))
#         })
#     }
# }
# write.csv(aiccs, "d2p10_aicc.csv", col.names = TRUE, row.names = TRUE)

read.csv("d2p10_aicc.csv", header = TRUE)

Based on the AICc Values, we find out the following three models are the best models to fit.

\[SARIMA(p=7,d=1,q=1)(P=1,D=1,Q=2)_{S=12}\] \[SARIMA(p=7,d=1,q=2)(P=1,D=1,Q=2)_{S=12}\] \[SARIMA(p=7,d=1,q=3)(P=1,D=1,Q=2)_{S=12}\]

Model Diagnostics:

\(SARIMA(p=7,d=1,q=1)(P=1,D=1,Q=2)_{S=12}\):

# fit the first model
fit.1 = sarima(data.train, p=7, d=1, q=1, P=1, D=1, Q=2, S=12, detail=FALSE)
fit.1

## $fit
## 
## Call:
## arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, Q), period = S), 
##     include.mean = !no.constant, transform.pars = trans, fixed = fixed, optim.control = list(trace = trc, 
##         REPORT = 1, reltol = tol))
## 
## Coefficients:
##         ar1     ar2    ar3     ar4     ar5     ar6     ar7     ma1   sar1
##       0.232  -0.042  0.015  -0.032  -0.066  -0.068  -0.134  -0.862  0.729
## s.e.  0.089   0.081  0.077   0.077   0.080   0.083   0.081   0.067  0.107
##         sma1   sma2
##       -1.922  1.000
## s.e.   0.298  0.309
## 
## sigma^2 estimated as 209:  log likelihood = -855,  aic = 1734
## 
## $degrees_of_freedom
## [1] 192
## 
## $ttable
##      Estimate     SE  t.value p.value
## ar1    0.2317 0.0888   2.6081  0.0098
## ar2   -0.0425 0.0814  -0.5217  0.6025
## ar3    0.0155 0.0772   0.2007  0.8411
## ar4   -0.0317 0.0768  -0.4135  0.6797
## ar5   -0.0664 0.0805  -0.8244  0.4107
## ar6   -0.0679 0.0834  -0.8141  0.4166
## ar7   -0.1345 0.0814  -1.6523  0.1001
## ma1   -0.8622 0.0674 -12.7836  0.0000
## sar1   0.7294 0.1065   6.8473  0.0000
## sma1  -1.9217 0.2984  -6.4390  0.0000
## sma2   1.0000 0.3094   3.2320  0.0014
## 
## $AIC
## [1] 8.542
## 
## $AICc
## [1] 8.548
## 
## $BIC
## [1] 8.738

# Ljung-Box and Shapiro-Wilk test
Box.test(resid(fit.1$fit), type="Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  resid(fit.1$fit)
## X-squared = 0.032, df = 1, p-value = 0.9

shapiro.test(resid(fit.1$fit))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(fit.1$fit)
## W = 0.98, p-value = 0.01

ts.plot(resid(fit.1$fit), main = "Fitted Residuals")

par(mfrow=c(2,2))
acf(resid(fit.1$fit), main="ACF")
pacf(resid(fit.1$fit), main="PACF")
hist(resid(fit.1$fit), main="Histogram")
qqnorm(resid(fit.1$fit))
qqline(resid(fit.1$fit), col="blue")
title("Fitted Residuals Diagnostics", outer=TRUE)

We can see for \(SARIMA(p=7,d=1,q=2)(P=1,D=1,Q=2)_{S=12}\) model, the Ljung-Box independence test does pass based on its p-value, the Shapiro-Wilk normality test does pass based on its p-value. Therefore, we will leave this model as a great backup model.

\(SARIMA(p=7,d=1,q=2)(P=1,D=1,Q=2)_{S=12}\):

# fit the second model
fit.2 = sarima(data.train, p=7, d=1, q=2, P=1, D=1, Q=2, S=12, detail=FALSE)
fit.2

## $fit
## 
## Call:
## arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, Q), period = S), 
##     include.mean = !no.constant, transform.pars = trans, fixed = fixed, optim.control = list(trace = trc, 
##         REPORT = 1, reltol = tol))
## 
## Coefficients:
##         ar1     ar2    ar3     ar4     ar5     ar6     ar7     ma1    ma2
##       0.731  -0.227  0.000  -0.074  -0.078  -0.055  -0.131  -1.362  0.497
## s.e.  0.215   0.123  0.093   0.093   0.093   0.092   0.094   0.203  0.194
##        sar1    sma1   sma2
##       0.679  -1.914  0.999
## s.e.  0.114   0.411  0.427
## 
## sigma^2 estimated as 206:  log likelihood = -853.9,  aic = 1734
## 
## $degrees_of_freedom
## [1] 191
## 
## $ttable
##      Estimate     SE t.value p.value
## ar1    0.7308 0.2145  3.4062  0.0008
## ar2   -0.2269 0.1230 -1.8447  0.0666
## ar3    0.0002 0.0929  0.0020  0.9984
## ar4   -0.0744 0.0935 -0.7963  0.4268
## ar5   -0.0781 0.0933 -0.8371  0.4036
## ar6   -0.0546 0.0925 -0.5908  0.5553
## ar7   -0.1308 0.0939 -1.3931  0.1652
## ma1   -1.3619 0.2027 -6.7177  0.0000
## ma2    0.4975 0.1938  2.5669  0.0110
## sar1   0.6792 0.1141  5.9524  0.0000
## sma1  -1.9144 0.4106 -4.6630  0.0000
## sma2   0.9986 0.4273  2.3370  0.0205
## 
## $AIC
## [1] 8.54
## 
## $AICc
## [1] 8.549
## 
## $BIC
## [1] 8.753

# Ljung-Box and Shapiro-Wilk test
Box.test(resid(fit.2$fit), type="Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  resid(fit.2$fit)
## X-squared = 0.0083, df = 1, p-value = 0.9

shapiro.test(resid(fit.2$fit))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(fit.2$fit)
## W = 0.99, p-value = 0.03

ts.plot(resid(fit.2$fit), main = "Fitted Residuals")

par(mfrow=c(2,2))
acf(resid(fit.2$fit), main="ACF")
pacf(resid(fit.2$fit), main="PACF")
hist(resid(fit.2$fit), main="Histogram")
qqnorm(resid(fit.2$fit))
qqline(resid(fit.2$fit), col="blue")
title("Fitted Residuals Diagnostics", outer=TRUE)

We can see for \(SARIMA(p=7,d=1,q=2)(P=1,D=1,Q=2)_{S=12}\) model, same as previous model, the Ljung-Box independence test does pass based on its p-value, the Shapiro-Wilk normality test does pass based on its p-value. Therefore, we will leave this model as a great backup model.

\(SARIMA(p=7,d=1,q=3)(P=1,D=1,Q=2)_{S=12}\):

# fit the third model
fit.3 = sarima(data.train, p=7, d=1, q=3, P=1, D=1, Q=2, S=12, detail=FALSE)
fit.3

## $fit
## 
## Call:
## arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, Q), period = S), 
##     include.mean = !no.constant, transform.pars = trans, fixed = fixed, optim.control = list(trace = trc, 
##         REPORT = 1, reltol = tol))
## 
## Coefficients:
##         ar1     ar2    ar3     ar4     ar5     ar6     ar7     ma1    ma2
##       1.747  -1.270  0.296  -0.085  -0.003  -0.022  -0.014  -2.388  2.194
## s.e.  0.194   0.238  0.182   0.170   0.171   0.150   0.090   0.184  0.307
##          ma3   sar1    sma1   sma2
##       -0.758  0.636  -1.911  1.000
## s.e.   0.149  0.105   0.243  0.253
## 
## sigma^2 estimated as 202:  log likelihood = -852.2,  aic = 1732
## 
## $degrees_of_freedom
## [1] 190
## 
## $ttable
##      Estimate     SE  t.value p.value
## ar1    1.7469 0.1943   8.9912  0.0000
## ar2   -1.2701 0.2378  -5.3405  0.0000
## ar3    0.2965 0.1816   1.6325  0.1042
## ar4   -0.0852 0.1696  -0.5025  0.6159
## ar5   -0.0030 0.1709  -0.0178  0.9858
## ar6   -0.0222 0.1501  -0.1479  0.8826
## ar7   -0.0137 0.0904  -0.1516  0.8796
## ma1   -2.3879 0.1844 -12.9481  0.0000
## ma2    2.1938 0.3066   7.1550  0.0000
## ma3   -0.7582 0.1490  -5.0886  0.0000
## sar1   0.6364 0.1048   6.0696  0.0000
## sma1  -1.9111 0.2428  -7.8705  0.0000
## sma2   0.9999 0.2527   3.9563  0.0001
## 
## $AIC
## [1] 8.534
## 
## $AICc
## [1] 8.543
## 
## $BIC
## [1] 8.762

# Ljung-Box and Shapiro-Wilk test
Box.test(resid(fit.3$fit), type="Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  resid(fit.3$fit)
## X-squared = 0.00014, df = 1, p-value = 1

shapiro.test(resid(fit.3$fit))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(fit.3$fit)
## W = 0.99, p-value = 0.03

ts.plot(resid(fit.3$fit), main = "Fitted Residuals")

par(mfrow=c(2,2))
acf(resid(fit.3$fit), main="ACF")
pacf(resid(fit.3$fit), main="PACF")
hist(resid(fit.3$fit), main="Histogram")
qqnorm(resid(fit.3$fit))
qqline(resid(fit.3$fit), col="blue")
title("Fitted Residuals Diagnostics", outer=TRUE)

We can see for \(SARIMA(p=7,d=1,q=3)(P=1,D=1,Q=2)_{S=12}\) model, same as previous model, the Ljung-Box independence test does pass based on its p-value, the Shapiro-Wilk normality test does pass based on its p-value. Therefore, we will leave this model as a backup model.

The Model for Forecasting:

Since all three models passed the independence and normality test, we will use the model with best test statistics. Therefore, we will use \(SARIMA(p=7,d=1,q=3)(P=1,D=1,Q=2)_{S=12}\) for the model forecast.

Model Forecast:

# model forecast
data.pred = predict(fit.3$fit, n.ahead=12)
data.frame(pred = data.pred$pred, true = data.test)

# plot the forecasted values
ts.plot(data.train, ylab = "Monthly Mean Downwelling Global Solar (Watts m^-2)",
                    xlab = "Months(start on 2003.01)",xlim=c(192,228))
points(ts(data.test,start=c(217,1)), cex=0.8, pch=1, col="red")
lines(ts(data.test,start=c(217,1)), cex=0.8, pch=1, col="red")
points(data.pred$pred, cex=0.8, col="blue")
lines(data.pred$pred, cex=0.8, col="blue")
lines(data.pred$pred+1.96*data.pred$se, lty=2, col="blue")
lines(data.pred$pred-1.96*data.pred$se, lty=2, col="blue")

legend(191,350,legend=c("Predicted Value", "True Value"),col=c("blue","red"),lty=1,cex=0.8)
title("Monthly Mean Sollar Radiation in Hanford, CA")

Conclusion

We use year 2021 as our test set, and predict 2021 based on our model. We can see the \(SARIMA(p=7,d=1,q=3)(P=1,D=1,Q=2)_{S=12}\) model fits really well. All the true values are within the predicted confidence interval. Therefore, this model is a valid model to predict the monthly mean sollar radiation in Hanford, CA.

References

Hanford, California, United States (HNX) Continuous in-situ measurements of solar radiation.Earth System Research Laboratory, Global Monitoring Laboratory, URL: https://gml.noaa.gov/aftp/data/radiation/solrad/hnx/
Rob J Hyndman, Box–Jenkins modelling, URL: http://robjhyndman.com/papers/BoxJenkins.pdf

Appendix

# knit options
knitr::opts_chunk$set(fig.width=8, fig.height=6, message = F, warning = F)
options(digits = 4)

# packages
library(dplyr)
library(aTSA)
library(astsa)
setwd("hnx_solar")
# read_list <- list.files(pattern = "\\.dat")

# merge <- read.table(file = "hnx03001.dat", skip = 2, header = FALSE)

# colnames(merge) = c("year","jday","month","day","hour","min",
#                     "dt","zen","dw_psp","qc_dwpsp","direct","qc_direct",
#                     "diffuse","qc_diffuse","uvb","qc_uvb",
#                     "uvb_temp","qc_uvb_temp","std_dw_psp",
#                     "std_direct","std_diffuse","std_uvb")

# merge[merge == -9999.9] <- NA

# merge <- data.frame(merge["year"], merge["month"], merge["day"], merge["dw_psp"])

# for (file in read_list[-which(read_list == "hnx03001.dat")]) {
#     print(paste("processing:", file))
#     new = read.table(file, skip = 2)
#     colnames(new) = c("year","jday","month","day","hour","min",
#                       "dt","zen","dw_psp","qc_dwpsp","direct","qc_direct",
#                       "diffuse","qc_diffuse","uvb","qc_uvb",
#                       "uvb_temp","qc_uvb_temp","std_dw_psp",
#                       "std_direct","std_diffuse","std_uvb")

#     new[new == -9999.9] <- NA
#     new <- data.frame(new["year"], new["month"], new["day"], new["dw_psp"])

#     merge = rbind(merge, new)
# }

# data = merge %>% group_by(year, month) %>% 
#                 summarize(monthly_dw_psp = mean(dw_psp, na.rm = TRUE))

# write.csv(data, file = "hnx_solar_monthly.csv", col.names = TRUE)

# read from merged monthly data
data = read.csv("hnx_solar_monthly.csv")$monthly_dw_psp
data.train = data[1:216]
data.test = data[217:228]
# plot the original data
ts.plot(data.train, ylab = "Monthly Mean Downwelling Global Solar (Watts m^-2)", 
                 xlab = "Months(start on 2003.01)")
title("Monthly Mean Sollar Radiation in Hanford, CA")
par(mfrow=c(1,2))
acf(data.train, main = "ACF Plot")
pacf(data.train, main = "PACF Plot")
adf.test(data)
# take first difference
data.diff = diff(data.train)
ts.plot(data.diff)
title("TS Plot After First Difference")
par(mfrow=c(1,2))
acf(data.diff, main = "ACF Plot After First Difference")
pacf(data.diff, main = "PACF Plot After First Difference")
# take second difference
data.diff2 = diff(data.diff)
ts.plot(data.diff2)
title("TS Plot After Second Difference")
par(mfrow=c(1,2))
acf(data.diff2, main = "ACF Plot After Second Difference")
pacf(data.diff2, main = "PACF Plot After Second Difference")
# aiccs <- matrix(NA, nr =6, nc = 6)
# dimnames(aiccs) = list(q = c(1:6), Q = c(1:6))
setwd("hnx_solar")

# Model Estimation for d=1, p=7
# for (q in 1:6) {
#     for (Q in 1:6) {
#         try({
#             aiccs[q,Q] = sarima(xdata = data.train, 
#                             p = 7, d = 1, q = q, 
#                             P = 1, D = 1, Q = Q, S = 12)$AICc
#             print(paste("d1p7: {", " q: ", q, "Q: ", Q, "}"))
#         })
#     }
# }
# write.csv(aiccs, "d1p7_aicc.csv", col.names = TRUE, row.names = TRUE)

read.csv("d1p7_aicc.csv", header = TRUE)
setwd("hnx_solar")

# Model Estimation for d=2, p=10
# for (q in c(1:6)) {
#     for (Q in c(1:6)) {
#         try({
#             aiccs[q,Q] = sarima(xdata = data.train, 
#                             p = 10, d = 2, q = q, 
#                             P = 1, D = 1, Q = Q, S = 12)$AICc
#             print(paste("d2p10: {", " q: ", q, "Q: ", Q, "}"))
#         })
#     }
# }
# write.csv(aiccs, "d2p10_aicc.csv", col.names = TRUE, row.names = TRUE)

read.csv("d2p10_aicc.csv", header = TRUE)
# fit the first model
fit.1 = sarima(data.train, p=7, d=1, q=1, P=1, D=1, Q=2, S=12, detail=FALSE)
fit.1

# Ljung-Box and Shapiro-Wilk test
Box.test(resid(fit.1$fit), type="Ljung-Box")
shapiro.test(resid(fit.1$fit))

ts.plot(resid(fit.1$fit), main = "Fitted Residuals")
par(mfrow=c(2,2))
acf(resid(fit.1$fit), main="ACF")
pacf(resid(fit.1$fit), main="PACF")
hist(resid(fit.1$fit), main="Histogram")
qqnorm(resid(fit.1$fit))
qqline(resid(fit.1$fit), col="blue")
title("Fitted Residuals Diagnostics", outer=TRUE)
# fit the second model
fit.2 = sarima(data.train, p=7, d=1, q=2, P=1, D=1, Q=2, S=12, detail=FALSE)
fit.2

# Ljung-Box and Shapiro-Wilk test
Box.test(resid(fit.2$fit), type="Ljung-Box")
shapiro.test(resid(fit.2$fit))

ts.plot(resid(fit.2$fit), main = "Fitted Residuals")
par(mfrow=c(2,2))
acf(resid(fit.2$fit), main="ACF")
pacf(resid(fit.2$fit), main="PACF")
hist(resid(fit.2$fit), main="Histogram")
qqnorm(resid(fit.2$fit))
qqline(resid(fit.2$fit), col="blue")
title("Fitted Residuals Diagnostics", outer=TRUE)
# fit the third model
fit.3 = sarima(data.train, p=7, d=1, q=3, P=1, D=1, Q=2, S=12, detail=FALSE)
fit.3

# Ljung-Box and Shapiro-Wilk test
Box.test(resid(fit.3$fit), type="Ljung-Box")
shapiro.test(resid(fit.3$fit))

ts.plot(resid(fit.3$fit), main = "Fitted Residuals")
par(mfrow=c(2,2))
acf(resid(fit.3$fit), main="ACF")
pacf(resid(fit.3$fit), main="PACF")
hist(resid(fit.3$fit), main="Histogram")
qqnorm(resid(fit.3$fit))
qqline(resid(fit.3$fit), col="blue")
title("Fitted Residuals Diagnostics", outer=TRUE)
# model forecast
data.pred = predict(fit.3$fit, n.ahead=12)
data.frame(pred = data.pred$pred, true = data.test)

# plot the forecasted values
ts.plot(data.train, ylab = "Monthly Mean Downwelling Global Solar (Watts m^-2)",
                    xlab = "Months(start on 2003.01)",xlim=c(192,228))
points(ts(data.test,start=c(217,1)), cex=0.8, pch=1, col="red")
lines(ts(data.test,start=c(217,1)), cex=0.8, pch=1, col="red")
points(data.pred$pred, cex=0.8, col="blue")
lines(data.pred$pred, cex=0.8, col="blue")
lines(data.pred$pred+1.96*data.pred$se, lty=2, col="blue")
lines(data.pred$pred-1.96*data.pred$se, lty=2, col="blue")

legend(191,350,legend=c("Predicted Value", "True Value"),col=c("blue","red"),lty=1,cex=0.8)
title("Monthly Mean Sollar Radiation in Hanford, CA")