Declaration of original work i
Abstract iii
1 Introduction 1
2 Literature Review 2
3 Experimental Design 5
3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 K-Fold Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Measuring Model Accuracy . . . . . . . . . . . . . . . . . . . . . . 7
4 Methods 9
4.1 LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.1.1 FISTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.3 Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.4 Extreme Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . 16
4.5 LightGBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.5.1 Gradient-based One-sided Sampling . . . . . . . . . . . . . 20
4.5.2 Exclusive Feature Bundling . . . . . . . . . . . . . . . . . . 20
4.6 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.6.1 Adam Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 25
4.6.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . 26
44 trang |
Chia sẻ: honganh20 | Ngày: 09/03/2022 | Lượt xem: 295 | Lượt tải: 0
Bạn đang xem trước 20 trang tài liệu Disquisitiones arithmeticæ predicting the prices for breakfasts and beds, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
s issue is critical in our case
as there are machine learning algorithms that have a plenty number of hyper-
parameters with different combinations required to be tested. For instance, there
are more than 10 hyper-parameters to be tuned in Extreme Gradient Boosting,
which is infeasible to use K-Fold Cross Validation for all of the compositions.
Therefore, we only tuned some parameters supposed to have vital impacts on the
model performance while leaving the others in default values set by its package.
3.3 Measuring Model Accuracy
In this project, we used several machine learning algorithms with different sets of
parameters to build a price predictor. In order to choose the best candidate for
this task, there needs to be some metrics that assess how those models perform.
The performance is quantified by showing how close the predicted value of a given
observation is to the true value of that observation. For a regression problem, the
most commonly-used metric is mean squared error (MSE) (Hastie 2017, p. 29),
which is given by,
MSE =
1
n
n∑
i=1
(yi − fˆ(xi))2 (3.1)
where fˆ(xi) is the prediction produced by fˆ for the ith observation. The
MSE will be small if a model generates precise values and vice versa. In general,
the MSE is computed to optimise a model using the training dataset and then
evaluated that model performance using the testing dataset. The MSE according
to the above formula is not bounded to any range. The smallest MSE is 0, the
result of a model with perfect predictions and we know that it is nearly impossible
CHAPTER 3. EXPERIMENTAL DESIGN 8
to have that in reality. Therefore, by choosing the smallest MSE among our models
we do not if that model can become a real tool for a price suggestion practically.
Thus, it is where the R2 statistic comes in as an alternative measure.
The R2 statistic shows the fraction of variance in the target that can be pre-
dicted using the features (James H. Stock 2020, p. 153). Then metrics always takes
on a values between 0 and 1, where an R2 near 0 provides a model with a bad
accuracy while an R2 close to 1 provides a model good at predicting the target.
The formula of this metrics is given by,
R2 = 1−
∑n
i=1(yi − y¯)2∑n
i=1(yi − fˆ(xi))2
(3.2)
where y¯ is the mean of the target that we try to predict. Additionally, the
formula can also be derived into this
R2 = 1−
∑n
i=1(yi−y¯)2
n∑n
i=1(yi−fˆ(xi))2
n
in which we can write like this regard to (3.1)
R2 = 1− MSE of a model
MSE of the mean of data
(3.3)
As the MSE gets smaller toward 0, the R2 gets bigger toward 1. Therefore, we
can interpret that the R2 is a rescaling of the MSE. This is the reason for us to
choose the R2 as the main metrics for model selection as its intuitive scale appears
to be better descriptively.
Chapter 4
Methods
4.1 LASSO
Least absolute shrinkage and selection operator (LASSO) (Tibshirani 1996) is a
regression analysis technique that was introduced to improve the prediction accu-
racy and perform as feature selection method for regression models. LASSO seeks
to find the solution of this following problem,
arg min
w∈Rd
{‖Y −Xw‖22 + α‖w‖1} (4.1)
where X ∈ RN×d the data matrix with N records and d features, Y ∈ RN the
target vector, w ∈ Rd the weight parameters and the subscripts 1 and 2 indicate the
l1 and l2 norms respectively (Appendix A). The problem above is indifferentible so
we cannot apply the common algorithm for regression models Gradient Descent in
order to compute LASSO. However, there have been various mathematical theories
to compute the solution of the LASSO, including Coordinate Descent (Tibshirani
et al. 2010) and Least Angle Regression (Efron et al. 2004), which are installed in
popular machine learning package such as Scikit Learn 1. Instead of using those
two in a pre-written package, we attempted to write an alternative algorithm,
FISTA algorithm (Beck & Teboulle 2009), to solve the LASSO problem ourselves
1LASSO User Guide on Scikit-learn Document
9
CHAPTER 4. METHODS 10
using Numpy package (Oliphant 2006).
4.1.1 FISTA
Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) is an iterative algorithm
based on the application of proximal operators to solve non-differentible convex
optimisation problems . In particular, the general optimisation problem is
X = arg min
X∈Rd
{f(X) + g(X)} (4.2)
, where:
• g : Rn 7→ R is a continuous convex function, which is possibly non-smooth ,
i.e, indifferentible
• g : Rn 7→ R is a smooth convex function with Lipschitz continuous graident
L(f)
– Lipschitz constant: a function (f) such that
‖∇f(x) − ∇f(y)‖ ≤ L(f)‖x − y‖, for all x, y ∈ Rn, then L(f) is a
Lipschitz constant of ∇f
FISTA can be used in many problems related to 4.2. LASSO is among the best
known. Hence, we can apply this algorithm to solve the following Loss function of
LASSO
Loss = min
W
{
1
2
‖Xw − Y ‖+ α‖w‖1
}
This is a slightly modified version of 4.1 as we add a fraction of 2 in the first
term for mathematical convenience. Then we set the loss function into this form:
Loss = min
W
{f(w) + g(w)}
For this problem, our job is to find two functions: ∇f(w) and L(f) to proceed
the algorithm. Firstly, we compute the former function:
CHAPTER 4. METHODS 11
We have: f(w) = 12‖Xw − Y ‖
Then applying the chain rule (B), we get this partial derivative respected to
the weight:
∇f(w) = XT (Xw − Y )
Now we find the Lipschitz constant through this ‖∇f(a)−∇f(b)‖. By expanding
the argument we have:
‖XT (Xa− Y )−XT (Xb− Y )‖
Thus, we factorise the common term XTX: ‖XTX(a − b)‖ .Applying this norm
inequality ‖A(a−b)‖ ≤ ‖A‖‖a−b‖(Benning 2019), we have the Lipschitz constant
L(f) found that is L = ‖XTX‖
FISTA is a refined version of Iterative Shrinkage-Thresholding Algorithm (ISTA)
in which both methods seek to find the solution of this proximal function (Beck &
Teboulle 2009):
PL(v) = arg min
w
{
g(w) +
L
2
‖w − (v − 1
2
∇f(v)‖2
}
For this problem, we substitute g(w) = λ‖w‖1 and set z = v − 12∇f(v). Then
it becomes
PL(z) = arg min
w
{
λ‖w‖1 + L
2
‖w − z‖2
}
With some steps using calculus, we get this result
S λ
L
(z) =
z − λL , z > λL
0, |z| ≤ λL
z + λL , z < − λL
(4.3)
The formula (4.3) can be written as
Stau(z) = sign(z)(|z| − τ)+
CHAPTER 4. METHODS 12
where the ”sign” function is defined as
sign(x) =
+1, x > 0
0, x = 0
−1, x < 0
and (a)+ is defined as
(a)+ =
0, a < 0a, a ≥ 0
As a consequence, we have enough recipes to run the FISTA. Since the Lipschitz
constant in this case is easy to compute, we follow the algorithm with constant step
size according to the paper (Beck & Teboulle 2009). The algorithm is proceeded
as below
Algorithm 1 FISTA with constant step size
Input L = L(f)−A Lipschitz constant of ∇f , λ: regularized parameter
Initialise: w0 ∈ Rn, v1 = w0, t1 = 1
Iterate:
for k = 1, ...,K − 1 do
compute zk = vk − 1L∇f(v)
compute wk = S λ
L
(zk)
compute tk+1 =
1+
√
1+4t2k
2
compute vt+1 = wk +
tk−1
tk+1
(wk − wk−1)
end for
return wK
4.2 Random Forest
Random Forest (Breiman 2001) is an ensemble method that uses bagging tech-
nique. The design of ensemble learning is to construct a prediction model by
CHAPTER 4. METHODS 13
applying a multiple machine learning algorithms in order to achieve a better pre-
dictive power rather than using those algorithm alone (Trevor Hastie 2009, p. 605).
Bagging (or Bootstrap Aggregating) is is to average the results of models in the en-
semble equally. This technique trains each model in the ensemble using a bootstrap
sample, a subset that is randomly drawn from the training dataset (Trevor Hastie
2009, p. 282), thereby reducing the variance. The Random Forest algorithm con-
tains a collection of decision trees. Its output is the class that has the most votes for
classification problem or the mean prediction of the individual trees for regression.
Figure 4.1: Random Forest structure (Source)
The Random Forest Regression in our study operates through this following
algorithm (Trevor Hastie 2009, p. 588):
CHAPTER 4. METHODS 14
Algorithm 2 Random Forest for Regression
Iterate:
for b = 1, ...,B do
1. Draw of boostrap sample Z of size N from the training data.
2. Grow a decision tree Tb to the boostrapped data, by recursively
repeating the following steps for each terminal node of the tree, until
the minimum node size nmin is reached,
i. Select m variables at random from the p variables.
ii. Pick the best variable/split-point among the m using Mean
Squared Errors
iii. Split the node into two daughter nodes
end for
Output the ensemble of trees Tb
B
1
return The prediction of a new point x: 1
B
∑B
1 Tb(x)
The construct of Decision Tree is referred to the original paper for more details
(Breiman et al. 1984). Nonetheless, the Random Forest algorithm only uses a
limited number, which is less than the total amount, of features selected randomly
to decide the candidate to split a node. This helps remove the problem that the
ensemble over-relies on an individual feature and have a fair use of all features,
making the model more robust.
4.3 Gradient Boosting
Gradient Boosting is another form of ensemble method that applies boosting tech-
nique. The boosting method is to combine the predictions of many weak learner
to generate a powerful ”committee”. Different from Random Forest, which builds
a forest of decision trees simultaneously, Gradient Boosting generate trees sequen-
tially, each of which is to improve the errors made by the previous trees in the
series.
The Gradient Boosting Regression algorithm is as follow (Friedman 2002):
CHAPTER 4. METHODS 15
Algorithm 3 Gradient Boosting Regression
Initialise: f0(x) = arg minγ
∑N
i=1 L(yi, γ), learning rate ν
for m = 1, ...,M do
1. For i = 1, ...,N compute
rim = −
[
∂L(yi, f(xi))
∂f(xi)
]
f=fm−1
2. Fit a regression tree to the targets rim giving terminal regions Rjm,
where j = 1, ...,Jm
3. For j = 1, ...,Jm compute
γjm = arg min
γ
∑
xi∈Rjm
L(yi, fm−1(xi) + γ)
4. Update fm(x) = fm−1(x) + ν
∑Jm
j=1 γI(x ∈ Rjm)
end for
return fˆ(x) = fM(x)
The Algorithm 3 is indicated by the choice of of the loss function L(y, f(x)).
In this study, our choice of the loss criteria is the ’least-squares’ L(y, f(x)) =
1
2(yi − f(xi))2. By optimising this function, we can find the first model, which is
a single terminal node tree showing the mean of the target y in the training set.
Moreover, the negative gradient of the loss function computed in each iteration
is called pseudo residual r. The succeeding trees are built based on this paper
(Breiman et al. 1984). Thus, each tree corrects the mistakes of the previous trees.
The corrections is scaled by the learning rate ν to avoid the problem of high
variance, increasing the robustness.
CHAPTER 4. METHODS 16
4.4 Extreme Gradient Boosting
Extreme Gradient Boosting (XGBoost) (Chen & Guestrin 2016) is another boost-
ing method that is applied in this study. XGBoost is a variant of Gradient Boosting
and this technique is architected with a different algorithm on how the embedeed
trees are structured such as changing the splitting method.
For a given dataset with n examples and m features D = {(xi, yi}(|D| = n, xi ∈
Rm, yi ∈ R), the method objective is to solve the given function:
L(t) =
n∑
i=1
l(yi, (ˆy)
t−1
i + wj) + Ω(wj)
Where Ij = {i|q(xi) = j} is a set of data points assigned to the j-th leaf, q is
the structure of each tree that maps an example to corresponding leaf index and
w is a vector of scores on a leaf. Here yˆ
(t)
i is the prediction of the i-th instance at
the t iteration and l represents the loss function to be determined. We chose the
common Mean Squared Errors as the loss function in this case.
In XGBoost, the second-order approximation of Taylor series is applied for the
loss function for computational efficiency:
l(yi, yˆ
t−1
i + wj) ≈ l(yi, yˆt−1i ) + giwj +
1
2
hiw
2
j
where gi = ∂yt−1 l(yi, yˆ
t−1
i ) and hi = ∂
2
yt−1 l(yi, yˆ
t−1
i are the first and second order
gradients of the loss function. The term g and h are inspired to the facts that
the first-order of a function is often called Gradient and the second-order is often
called Hessian respectively.
Remove the constant l(yi, yˆ
t−1
i ) of the approximation, the objective function
becomes:
L(t) =
n∑
i=1
(giwj +
1
2
hiw
2
j ) + Ω(wj)
CHAPTER 4. METHODS 17
We then set the regularized term as follow:
Ω(wj) = γT +
1
2
λ
T∑
j=1
w2j
Here γ is the pruning para meter and λ is the l2 regularized parameter. Hence,
our objective function is given as:
L(t) =
n∑
i=1
(giwj +
1
2
hiw
2
j ) + γT +
1
2
λ
T∑
j=1
w2j
=
T∑
j=1
(∑
i∈Ij
gi)wj +
1
2
(
∑
i∈Ij
hi + λ)w
2
j
+ γT
We could compress the whole equation into a simpler form by letting Gj =∑
i∈Ij gi and Hj =
∑
i∈Ij hi:
L(t) =
T∑
j=1
[
Gjwj +
1
2
(Hj + λ)w
2
j
]
+ γT (4.4)
We then optimise the equation 4.4 by using the first-order condition with re-
spect to wj :
wj = − Gj
Hj + λ
Hence, the corresponding objective function is:
L(t)(q) = −1
2
T∑
j=1
G2j
Hj + λ
+ γT (4.5)
The equation 4.5 can be used as a metric to measure the quality of a structure
q. It could be impossible to process all the tree structure q. Thus, a greedy
algorithm that starts from a single leaf and then iteratively adds branches to the
tree is used instead. Starting from a single leaf node, we split into two nodes
such that IL and IR are the instance sets of left and right nodes after splitting.
CHAPTER 4. METHODS 18
Therefore, we have I = IL ∪ IR, then the loss after the split is as follow:
Lsplit =
1
2
[
G2L
HL + λ
+
G2R
HR + λ
− G
2
j
Hj + λ
]
− γ (4.6)
The equation 4.6 can be decomposed into four parts: the score on the new
left leaf, the score on the new right leaf, the score on the original leaf and the
regularization put on the additional leaf. We can observe that if the bracket in
the equation is smaller than γ, we would not get a better result from adding that
branch. Thus, the branch is removed and this is how the pruning techniques in
tree based work.
In order to find the best candidate to split, we need an algorithm to perform
the work. In this work, we applied the so-called exact greedy algorithm, which is
defined as ”a split finding algorithm enumerates overall the possible splits on all
the features” (Chen & Guestrin 2016):
Algorithm 4 Exact Greedy Algorithm
Input: I, instance set of current node
Input: d, feature dimension
Initialise: Gain= 0
Iterate:
for k = 1, ..,m do
GL = 0, HL = 0
for j in sorted(I, by xjk) do
GL = GL + gj, HL = HL + hj
GR = G−GL, HR = H −HL
score = max
(
score,
G2L
HL+λ
+
G2R
HR+λ
− G2
H+λ
)
end for
end for
return Split with max score
The algorithm is computational demanding so it is required to sort the data
regards to feature values and then scan to compute the structure score of all
possible split solutions to find an optimal split. The original paper (Chen &
Guestrin 2016) is to refer to additional information on other splitting algorithms
CHAPTER 4. METHODS 19
and computations.
4.5 LightGBM
Figure 4.2: How LightGBM and other boosting algorithms work (Source)
LightGBM (Ke et al. 2017) is another vairant of Gradient Boosting method that
mainly focuses on speeding up the training efficiency. Instead of building trees
horizontally at level-wise, LightGBM grows trees vertically by adding a new tree
leaf at each iteration. The figure 4.2 displays the implementations of LightGBM
and other Gradient Boosting techniques.
The computational efficiency of LightGBM comes from the fact that this
method combines the two techiniques, Gradient-based One-sided Sampling (GOSS)
and Excludsive Feature Bundling (EFB). The descriptions of those two are shown
in the two sections below
CHAPTER 4. METHODS 20
4.5.1 Gradient-based One-sided Sampling
This technique is to improve the performance of the instances that have large gra-
dients, which generate large errors. Therefore, after each iteration, the instances
with large gradients are kept while a proportion of instances with small gradients
are dropped randomly. As a result, the authors claims that this treatment pro-
vides a more accurate estimation than the traditional method where instances are
uniformly and randomly sampled. The algorithm for this technique are given as:
Algorithm 5 Gradient-based One-sided Sampling
Input: I: training data, d: iteration
Input: a: sampling ratio of large gradient data
Input: b: sampling of small gradient data
Input: loss: loss function , L: weak learner, models= {}, fact=1−a
b
topN=a× len(I), randN= b× len(I)
for i = 1tod do
pred = models.predict(I)
g = loss(I, preds), w = {1, 1, . . . }
sorted = GetSortedIndices(abs(g))
topSet = sorted[1:topN]
randSet = RandomPick(sorted[topN:len(I)], randN)
usedSet = Topset + randSet
w[randSet] × = fact . Assign weight fact to the small gradient data
newModel = L(I[usedSet], -g[usedSet], w[usedSet])
model.append(newModel)
end for
4.5.2 Exclusive Feature Bundling
In practical applications, high-dimensional data normally have problem with spar-
sity. Such problems that many features are (almost) exclusive in sparse feature
space. In other words, they hardly get nonzero values at the same time. Thus,
EFB is designed to correct this problem by bundling those exclusive features into
a single feature. However, there are two issues when implementing this technique.
The first issue is to choose which features should be bundled and the second one
CHAPTER 4. METHODS 21
is how they are bundled.
Algorithm 6 Greedy Bundling
Input: F: features, K: max conflict count
Construct graph G
searchOrder = G.sortByDegree()
bundles = {}, bundlesConflict = {}
for i in searchOrder do
needNew = True
for j = 1tolen(bundles) do
cnt = ConflictCnt(bundles[j], F[i])
if cnt+ bundlesConflict [i] ≤ K then
bundles[j].add(F[i]), needNew = False
break
end if
end for
if needNew then
Add F[i] as a new bundle to bundles
end if
end for
return bundles
The Algorithm 6 is architected to determine features to be bundled. The graph
G in this algorithm is constructed as the following: we take features as vertices
and then add edges for every two features of they are not mutually exclusive.
The term ”conflict” here means that there are some features that are not 100%
mutually exclusive. We can tolerate a small number of this features while bundling
them in order to enhance computational efficiency.
For the second issue, the Algorithm 7 shows how a number of features should
be merged.
The original paper is referred for more details on the theoretical analysis of
the algorithm shown in this chapter and some experimental performances of Light-
GBM.
In this chapter, we only show the key points which LightGBM runs differently
from the other Gradient Boosting techniques. The technique of the construction
of a decision tree in each iteration is similar to the algorithm shown in Section 4.3.
CHAPTER 4. METHODS 22
Algorithm 7 Merge Exclusive Feature
Input: numData: number of data
Input: F: One bundle of exclusive features
binRanges = {0}, totalBin = 0
for f in F do
totalBin += f.numBin
binRanges.append(totalBin)
end for
newBin = newBin(numData)
for i = 1tonumData do
newBin[i] = 0
for j = 1tolen(F ) do
if F [j].bin[i] 6= 0 then
newBin[i] = F[j].bin[i] + binRanges[j]
end if
end for
end for
return newBin, binRanges
CHAPTER 4. METHODS 23
4.6 Neural Networks
Figure 4.3: An example of Neural Networks architecture (Source: VIASAT)
The terms of Neural Networks are biologically inspired by how human process in-
formation in their brains. Figure 4.3 displays the architecture of Neural Networks.
Artificial Neural Network consists of nodes, which can be called as neurons, and
activation functions. Each node have a collection of weights and a bias, which can
be computed through the learning process. An activation functions is to produce
an output for a node depending on its input values. Every Neural Network in-
cludes three kinds of layers. The input layer receives information from the records
of a dataset, the output layer results in the network prediction and hidden layers
connect the input and the output with one another. Mathematically, a model of
Neural Network is defined as follow (Benning 2020, p. 28):
CHAPTER 4. METHODS 24
fw(x) = ϕL(ϕL−1(...ϕ1(ϕ0(x,w1, b1), w2, b2)..., wL−1, bL−1)wL, bL) (4.7)
The equation 4.7 how a Neural Network of L layers is presented mathematically.
We have {ϕl}Ll=1 is a set of L activation functions that contain weights w = {wL}Ll=1
and biases b = {bL}Ll=1.
Typically, an activation function is in form of affine-linear transformation,
which is defined as
ϕ(x,W, b) = W Tx+ b (4.8)
where x ∈ Rn represents number of inputs, W ∈ Rn×m is a weighted matrix
and b ∈ Rm is a bias vector. By this way, the activation function maps n inputs
onto m outputs. In this study, we only used this type of activation functions in
the output layer. For the input and hidden layers, Rectified Linear Unit (ReLU)
is chosen. The function is as the following:
ϕ(x,W, b) = max(0,W Tx+ b) (4.9)
The function is a combination of the affine-linear transformations and the
rectifier, ϕ(x) = max(0, x). An advantage of using ReLU is that it is computational
efficient due to its simplicity. This contributes to make ReLU become one of
the most popular activation functions used for Neural Networks. However, this
function is not flawless as there appears a problem called ’Dying ReLU’.
When the function produces an output of zero or a negative value, the gradient
of the function becomes zero. Thus, the process of backpropagation, which is
mention in the subsection below, can not perform in that neuron, making it turn
off. As we train a Network that appears to have those neurons, we may end up
having a large part of the network doing nothing. A solution for the Dying ReLU is
that we can choose to use the so-called Leaky ReLU, where we adjust the function
so its gradient has a small slope for negative values. Nevertheless, we ignored this
issue and only used ReLU for the hidden layers in this study. We would like to try
the Leaky ReLU in our future work for further experiments with Neural Networks.
CHAPTER 4. METHODS 25
In order to produce an accurate prediction, the output layer needs to go
through a loss function to match with the actual values closely. Combined with
the choice of loss function, a model of Neural Networks can be generalised into
this problem:
arg min
w,b
{
1
s
s∑
i=1
li(fw(xi), yi)
}
(4.10)
where i ∈ {1, ...,s} and {li}si=1 is a family of loss functions. For the regression
problem of this study, we choose the least-squares, li(x) =
1
2(fw(xi)− y)2 in order
to optimise the problem 4.10 as follow:
arg min
w,b
{
1
2s
s∑
i=1
(fw(xi)− y)2
}
Since we have a differentible neural network with differentible activation func-
tions. We can determine an algorithm to train the model. Our choice in this task
is Adam.
4.6.1 Adam Algorithm
Adaptive Moment Estimation (Adam) is an extension of Stochastic Gradient De-
scent (SGD) (Kingma & Ba 2014). While SGD keeps only one learning for all
weights and does not adjust the rate during training, Adam provides a learning
rate for each weight and separtely adjust the rate through the training process.
The authors also claims that Adam inherits the advantages of Adaptive Gradient
Algorithm (Adagrad) and Root Mean Square Propagation (RMSProp). The ben-
efit of the former is to improve performance with sparse gradients and the benefit
of the latter is to improve the performance on online and non-stationary problems.
As a result, Adam is insisted to be effective to solve practical problems related to
the use of Neural Network (Kingma & Ba 2014).
The algorithm of Adam is as follow regard to the original paper (Kingma &
Ba 2014):
CHAPTER 4. METHODS 26
Algorithm 8 Adam
Specify: f(θ): Stochastic objective function with parameters θ
Specify: α, β1, β2 ∈ [0, 1),
Initialise: θ0
m0 = 0
v0 = 0
for t = 1, ...,T do
compute gt = ∇θft(θt−1)
compute g2t = gt gt (: Hadamard product (A))
compute mt = β1mt−1 + (1− β1)gt
compute vt = β2vt−1 + (1− β2)g2t
compute mˆt =
mt
1−βt1
compute vˆt =
vt
1−βt2
compute θt = θt−1 − αmˆt√vˆt+
end for
return θT
In order to implement Algorithm 8, we need to compute the gradients with
respect to the weights in different layers. The process of this computation is called
backpropagation.
4.6.2 Backpropagation
Backward propagation of errors (Backpropagation) is the practice of readjusting
the weights and biases of a Neural Network based on the error rate obtained in the
previous epoch of the training process. The t
Các file đính kèm theo tài liệu này:
- disquisitiones_arithmetic_predicting_the_prices_for_breakfas.pdf