Disquisitiones arithmeticæ predicting the prices for breakfasts and beds

Declaration of original work i

Abstract iii

1 Introduction 1

2 Literature Review 2

3 Experimental Design 5

3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.2 K-Fold Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . 6

3.3 Measuring Model Accuracy . . . . . . . . . . . . . . . . . . . . . . 7

4 Methods 9

4.1 LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.1.1 FISTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.3 Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.4 Extreme Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . 16

4.5 LightGBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.5.1 Gradient-based One-sided Sampling . . . . . . . . . . . . . 20

4.5.2 Exclusive Feature Bundling . . . . . . . . . . . . . . . . . . 20

4.6 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.6.1 Adam Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 25

4.6.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . 26

 

pdf44 trang | Chia sẻ: honganh20 | Ngày: 09/03/2022 | Lượt xem: 242 | Lượt tải: 0download
Bạn đang xem trước 20 trang tài liệu Disquisitiones arithmeticæ predicting the prices for breakfasts and beds, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
s issue is critical in our case as there are machine learning algorithms that have a plenty number of hyper- parameters with different combinations required to be tested. For instance, there are more than 10 hyper-parameters to be tuned in Extreme Gradient Boosting, which is infeasible to use K-Fold Cross Validation for all of the compositions. Therefore, we only tuned some parameters supposed to have vital impacts on the model performance while leaving the others in default values set by its package. 3.3 Measuring Model Accuracy In this project, we used several machine learning algorithms with different sets of parameters to build a price predictor. In order to choose the best candidate for this task, there needs to be some metrics that assess how those models perform. The performance is quantified by showing how close the predicted value of a given observation is to the true value of that observation. For a regression problem, the most commonly-used metric is mean squared error (MSE) (Hastie 2017, p. 29), which is given by, MSE = 1 n n∑ i=1 (yi − fˆ(xi))2 (3.1) where fˆ(xi) is the prediction produced by fˆ for the ith observation. The MSE will be small if a model generates precise values and vice versa. In general, the MSE is computed to optimise a model using the training dataset and then evaluated that model performance using the testing dataset. The MSE according to the above formula is not bounded to any range. The smallest MSE is 0, the result of a model with perfect predictions and we know that it is nearly impossible CHAPTER 3. EXPERIMENTAL DESIGN 8 to have that in reality. Therefore, by choosing the smallest MSE among our models we do not if that model can become a real tool for a price suggestion practically. Thus, it is where the R2 statistic comes in as an alternative measure. The R2 statistic shows the fraction of variance in the target that can be pre- dicted using the features (James H. Stock 2020, p. 153). Then metrics always takes on a values between 0 and 1, where an R2 near 0 provides a model with a bad accuracy while an R2 close to 1 provides a model good at predicting the target. The formula of this metrics is given by, R2 = 1− ∑n i=1(yi − y¯)2∑n i=1(yi − fˆ(xi))2 (3.2) where y¯ is the mean of the target that we try to predict. Additionally, the formula can also be derived into this R2 = 1− ∑n i=1(yi−y¯)2 n∑n i=1(yi−fˆ(xi))2 n in which we can write like this regard to (3.1) R2 = 1− MSE of a model MSE of the mean of data (3.3) As the MSE gets smaller toward 0, the R2 gets bigger toward 1. Therefore, we can interpret that the R2 is a rescaling of the MSE. This is the reason for us to choose the R2 as the main metrics for model selection as its intuitive scale appears to be better descriptively. Chapter 4 Methods 4.1 LASSO Least absolute shrinkage and selection operator (LASSO) (Tibshirani 1996) is a regression analysis technique that was introduced to improve the prediction accu- racy and perform as feature selection method for regression models. LASSO seeks to find the solution of this following problem, arg min w∈Rd {‖Y −Xw‖22 + α‖w‖1} (4.1) where X ∈ RN×d the data matrix with N records and d features, Y ∈ RN the target vector, w ∈ Rd the weight parameters and the subscripts 1 and 2 indicate the l1 and l2 norms respectively (Appendix A). The problem above is indifferentible so we cannot apply the common algorithm for regression models Gradient Descent in order to compute LASSO. However, there have been various mathematical theories to compute the solution of the LASSO, including Coordinate Descent (Tibshirani et al. 2010) and Least Angle Regression (Efron et al. 2004), which are installed in popular machine learning package such as Scikit Learn 1. Instead of using those two in a pre-written package, we attempted to write an alternative algorithm, FISTA algorithm (Beck & Teboulle 2009), to solve the LASSO problem ourselves 1LASSO User Guide on Scikit-learn Document 9 CHAPTER 4. METHODS 10 using Numpy package (Oliphant 2006). 4.1.1 FISTA Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) is an iterative algorithm based on the application of proximal operators to solve non-differentible convex optimisation problems . In particular, the general optimisation problem is X = arg min X∈Rd {f(X) + g(X)} (4.2) , where: • g : Rn 7→ R is a continuous convex function, which is possibly non-smooth , i.e, indifferentible • g : Rn 7→ R is a smooth convex function with Lipschitz continuous graident L(f) – Lipschitz constant: a function (f) such that ‖∇f(x) − ∇f(y)‖ ≤ L(f)‖x − y‖, for all x, y ∈ Rn, then L(f) is a Lipschitz constant of ∇f FISTA can be used in many problems related to 4.2. LASSO is among the best known. Hence, we can apply this algorithm to solve the following Loss function of LASSO Loss = min W { 1 2 ‖Xw − Y ‖+ α‖w‖1 } This is a slightly modified version of 4.1 as we add a fraction of 2 in the first term for mathematical convenience. Then we set the loss function into this form: Loss = min W {f(w) + g(w)} For this problem, our job is to find two functions: ∇f(w) and L(f) to proceed the algorithm. Firstly, we compute the former function: CHAPTER 4. METHODS 11 We have: f(w) = 12‖Xw − Y ‖ Then applying the chain rule (B), we get this partial derivative respected to the weight: ∇f(w) = XT (Xw − Y ) Now we find the Lipschitz constant through this ‖∇f(a)−∇f(b)‖. By expanding the argument we have: ‖XT (Xa− Y )−XT (Xb− Y )‖ Thus, we factorise the common term XTX: ‖XTX(a − b)‖ .Applying this norm inequality ‖A(a−b)‖ ≤ ‖A‖‖a−b‖(Benning 2019), we have the Lipschitz constant L(f) found that is L = ‖XTX‖ FISTA is a refined version of Iterative Shrinkage-Thresholding Algorithm (ISTA) in which both methods seek to find the solution of this proximal function (Beck & Teboulle 2009): PL(v) = arg min w { g(w) + L 2 ‖w − (v − 1 2 ∇f(v)‖2 } For this problem, we substitute g(w) = λ‖w‖1 and set z = v − 12∇f(v). Then it becomes PL(z) = arg min w { λ‖w‖1 + L 2 ‖w − z‖2 } With some steps using calculus, we get this result S λ L (z) =  z − λL , z > λL 0, |z| ≤ λL z + λL , z < − λL (4.3) The formula (4.3) can be written as Stau(z) = sign(z)(|z| − τ)+ CHAPTER 4. METHODS 12 where the ”sign” function is defined as sign(x) =  +1, x > 0 0, x = 0 −1, x < 0 and (a)+ is defined as (a)+ = 0, a < 0a, a ≥ 0 As a consequence, we have enough recipes to run the FISTA. Since the Lipschitz constant in this case is easy to compute, we follow the algorithm with constant step size according to the paper (Beck & Teboulle 2009). The algorithm is proceeded as below Algorithm 1 FISTA with constant step size Input L = L(f)−A Lipschitz constant of ∇f , λ: regularized parameter Initialise: w0 ∈ Rn, v1 = w0, t1 = 1 Iterate: for k = 1, ...,K − 1 do compute zk = vk − 1L∇f(v) compute wk = S λ L (zk) compute tk+1 = 1+ √ 1+4t2k 2 compute vt+1 = wk + tk−1 tk+1 (wk − wk−1) end for return wK 4.2 Random Forest Random Forest (Breiman 2001) is an ensemble method that uses bagging tech- nique. The design of ensemble learning is to construct a prediction model by CHAPTER 4. METHODS 13 applying a multiple machine learning algorithms in order to achieve a better pre- dictive power rather than using those algorithm alone (Trevor Hastie 2009, p. 605). Bagging (or Bootstrap Aggregating) is is to average the results of models in the en- semble equally. This technique trains each model in the ensemble using a bootstrap sample, a subset that is randomly drawn from the training dataset (Trevor Hastie 2009, p. 282), thereby reducing the variance. The Random Forest algorithm con- tains a collection of decision trees. Its output is the class that has the most votes for classification problem or the mean prediction of the individual trees for regression. Figure 4.1: Random Forest structure (Source) The Random Forest Regression in our study operates through this following algorithm (Trevor Hastie 2009, p. 588): CHAPTER 4. METHODS 14 Algorithm 2 Random Forest for Regression Iterate: for b = 1, ...,B do 1. Draw of boostrap sample Z of size N from the training data. 2. Grow a decision tree Tb to the boostrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum node size nmin is reached, i. Select m variables at random from the p variables. ii. Pick the best variable/split-point among the m using Mean Squared Errors iii. Split the node into two daughter nodes end for Output the ensemble of trees Tb B 1 return The prediction of a new point x: 1 B ∑B 1 Tb(x) The construct of Decision Tree is referred to the original paper for more details (Breiman et al. 1984). Nonetheless, the Random Forest algorithm only uses a limited number, which is less than the total amount, of features selected randomly to decide the candidate to split a node. This helps remove the problem that the ensemble over-relies on an individual feature and have a fair use of all features, making the model more robust. 4.3 Gradient Boosting Gradient Boosting is another form of ensemble method that applies boosting tech- nique. The boosting method is to combine the predictions of many weak learner to generate a powerful ”committee”. Different from Random Forest, which builds a forest of decision trees simultaneously, Gradient Boosting generate trees sequen- tially, each of which is to improve the errors made by the previous trees in the series. The Gradient Boosting Regression algorithm is as follow (Friedman 2002): CHAPTER 4. METHODS 15 Algorithm 3 Gradient Boosting Regression Initialise: f0(x) = arg minγ ∑N i=1 L(yi, γ), learning rate ν for m = 1, ...,M do 1. For i = 1, ...,N compute rim = − [ ∂L(yi, f(xi)) ∂f(xi) ] f=fm−1 2. Fit a regression tree to the targets rim giving terminal regions Rjm, where j = 1, ...,Jm 3. For j = 1, ...,Jm compute γjm = arg min γ ∑ xi∈Rjm L(yi, fm−1(xi) + γ) 4. Update fm(x) = fm−1(x) + ν ∑Jm j=1 γI(x ∈ Rjm) end for return fˆ(x) = fM(x) The Algorithm 3 is indicated by the choice of of the loss function L(y, f(x)). In this study, our choice of the loss criteria is the ’least-squares’ L(y, f(x)) = 1 2(yi − f(xi))2. By optimising this function, we can find the first model, which is a single terminal node tree showing the mean of the target y in the training set. Moreover, the negative gradient of the loss function computed in each iteration is called pseudo residual r. The succeeding trees are built based on this paper (Breiman et al. 1984). Thus, each tree corrects the mistakes of the previous trees. The corrections is scaled by the learning rate ν to avoid the problem of high variance, increasing the robustness. CHAPTER 4. METHODS 16 4.4 Extreme Gradient Boosting Extreme Gradient Boosting (XGBoost) (Chen & Guestrin 2016) is another boost- ing method that is applied in this study. XGBoost is a variant of Gradient Boosting and this technique is architected with a different algorithm on how the embedeed trees are structured such as changing the splitting method. For a given dataset with n examples and m features D = {(xi, yi}(|D| = n, xi ∈ Rm, yi ∈ R), the method objective is to solve the given function: L(t) = n∑ i=1 l(yi, (ˆy) t−1 i + wj) + Ω(wj) Where Ij = {i|q(xi) = j} is a set of data points assigned to the j-th leaf, q is the structure of each tree that maps an example to corresponding leaf index and w is a vector of scores on a leaf. Here yˆ (t) i is the prediction of the i-th instance at the t iteration and l represents the loss function to be determined. We chose the common Mean Squared Errors as the loss function in this case. In XGBoost, the second-order approximation of Taylor series is applied for the loss function for computational efficiency: l(yi, yˆ t−1 i + wj) ≈ l(yi, yˆt−1i ) + giwj + 1 2 hiw 2 j where gi = ∂yt−1 l(yi, yˆ t−1 i ) and hi = ∂ 2 yt−1 l(yi, yˆ t−1 i are the first and second order gradients of the loss function. The term g and h are inspired to the facts that the first-order of a function is often called Gradient and the second-order is often called Hessian respectively. Remove the constant l(yi, yˆ t−1 i ) of the approximation, the objective function becomes: L(t) = n∑ i=1 (giwj + 1 2 hiw 2 j ) + Ω(wj) CHAPTER 4. METHODS 17 We then set the regularized term as follow: Ω(wj) = γT + 1 2 λ T∑ j=1 w2j Here γ is the pruning para meter and λ is the l2 regularized parameter. Hence, our objective function is given as: L(t) = n∑ i=1 (giwj + 1 2 hiw 2 j ) + γT + 1 2 λ T∑ j=1 w2j = T∑ j=1 (∑ i∈Ij gi)wj + 1 2 ( ∑ i∈Ij hi + λ)w 2 j + γT We could compress the whole equation into a simpler form by letting Gj =∑ i∈Ij gi and Hj = ∑ i∈Ij hi: L(t) = T∑ j=1 [ Gjwj + 1 2 (Hj + λ)w 2 j ] + γT (4.4) We then optimise the equation 4.4 by using the first-order condition with re- spect to wj : wj = − Gj Hj + λ Hence, the corresponding objective function is: L(t)(q) = −1 2 T∑ j=1 G2j Hj + λ + γT (4.5) The equation 4.5 can be used as a metric to measure the quality of a structure q. It could be impossible to process all the tree structure q. Thus, a greedy algorithm that starts from a single leaf and then iteratively adds branches to the tree is used instead. Starting from a single leaf node, we split into two nodes such that IL and IR are the instance sets of left and right nodes after splitting. CHAPTER 4. METHODS 18 Therefore, we have I = IL ∪ IR, then the loss after the split is as follow: Lsplit = 1 2 [ G2L HL + λ + G2R HR + λ − G 2 j Hj + λ ] − γ (4.6) The equation 4.6 can be decomposed into four parts: the score on the new left leaf, the score on the new right leaf, the score on the original leaf and the regularization put on the additional leaf. We can observe that if the bracket in the equation is smaller than γ, we would not get a better result from adding that branch. Thus, the branch is removed and this is how the pruning techniques in tree based work. In order to find the best candidate to split, we need an algorithm to perform the work. In this work, we applied the so-called exact greedy algorithm, which is defined as ”a split finding algorithm enumerates overall the possible splits on all the features” (Chen & Guestrin 2016): Algorithm 4 Exact Greedy Algorithm Input: I, instance set of current node Input: d, feature dimension Initialise: Gain= 0 Iterate: for k = 1, ..,m do GL = 0, HL = 0 for j in sorted(I, by xjk) do GL = GL + gj, HL = HL + hj GR = G−GL, HR = H −HL score = max ( score, G2L HL+λ + G2R HR+λ − G2 H+λ ) end for end for return Split with max score The algorithm is computational demanding so it is required to sort the data regards to feature values and then scan to compute the structure score of all possible split solutions to find an optimal split. The original paper (Chen & Guestrin 2016) is to refer to additional information on other splitting algorithms CHAPTER 4. METHODS 19 and computations. 4.5 LightGBM Figure 4.2: How LightGBM and other boosting algorithms work (Source) LightGBM (Ke et al. 2017) is another vairant of Gradient Boosting method that mainly focuses on speeding up the training efficiency. Instead of building trees horizontally at level-wise, LightGBM grows trees vertically by adding a new tree leaf at each iteration. The figure 4.2 displays the implementations of LightGBM and other Gradient Boosting techniques. The computational efficiency of LightGBM comes from the fact that this method combines the two techiniques, Gradient-based One-sided Sampling (GOSS) and Excludsive Feature Bundling (EFB). The descriptions of those two are shown in the two sections below CHAPTER 4. METHODS 20 4.5.1 Gradient-based One-sided Sampling This technique is to improve the performance of the instances that have large gra- dients, which generate large errors. Therefore, after each iteration, the instances with large gradients are kept while a proportion of instances with small gradients are dropped randomly. As a result, the authors claims that this treatment pro- vides a more accurate estimation than the traditional method where instances are uniformly and randomly sampled. The algorithm for this technique are given as: Algorithm 5 Gradient-based One-sided Sampling Input: I: training data, d: iteration Input: a: sampling ratio of large gradient data Input: b: sampling of small gradient data Input: loss: loss function , L: weak learner, models= {}, fact=1−a b topN=a× len(I), randN= b× len(I) for i = 1tod do pred = models.predict(I) g = loss(I, preds), w = {1, 1, . . . } sorted = GetSortedIndices(abs(g)) topSet = sorted[1:topN] randSet = RandomPick(sorted[topN:len(I)], randN) usedSet = Topset + randSet w[randSet] × = fact . Assign weight fact to the small gradient data newModel = L(I[usedSet], -g[usedSet], w[usedSet]) model.append(newModel) end for 4.5.2 Exclusive Feature Bundling In practical applications, high-dimensional data normally have problem with spar- sity. Such problems that many features are (almost) exclusive in sparse feature space. In other words, they hardly get nonzero values at the same time. Thus, EFB is designed to correct this problem by bundling those exclusive features into a single feature. However, there are two issues when implementing this technique. The first issue is to choose which features should be bundled and the second one CHAPTER 4. METHODS 21 is how they are bundled. Algorithm 6 Greedy Bundling Input: F: features, K: max conflict count Construct graph G searchOrder = G.sortByDegree() bundles = {}, bundlesConflict = {} for i in searchOrder do needNew = True for j = 1tolen(bundles) do cnt = ConflictCnt(bundles[j], F[i]) if cnt+ bundlesConflict [i] ≤ K then bundles[j].add(F[i]), needNew = False break end if end for if needNew then Add F[i] as a new bundle to bundles end if end for return bundles The Algorithm 6 is architected to determine features to be bundled. The graph G in this algorithm is constructed as the following: we take features as vertices and then add edges for every two features of they are not mutually exclusive. The term ”conflict” here means that there are some features that are not 100% mutually exclusive. We can tolerate a small number of this features while bundling them in order to enhance computational efficiency. For the second issue, the Algorithm 7 shows how a number of features should be merged. The original paper is referred for more details on the theoretical analysis of the algorithm shown in this chapter and some experimental performances of Light- GBM. In this chapter, we only show the key points which LightGBM runs differently from the other Gradient Boosting techniques. The technique of the construction of a decision tree in each iteration is similar to the algorithm shown in Section 4.3. CHAPTER 4. METHODS 22 Algorithm 7 Merge Exclusive Feature Input: numData: number of data Input: F: One bundle of exclusive features binRanges = {0}, totalBin = 0 for f in F do totalBin += f.numBin binRanges.append(totalBin) end for newBin = newBin(numData) for i = 1tonumData do newBin[i] = 0 for j = 1tolen(F ) do if F [j].bin[i] 6= 0 then newBin[i] = F[j].bin[i] + binRanges[j] end if end for end for return newBin, binRanges CHAPTER 4. METHODS 23 4.6 Neural Networks Figure 4.3: An example of Neural Networks architecture (Source: VIASAT) The terms of Neural Networks are biologically inspired by how human process in- formation in their brains. Figure 4.3 displays the architecture of Neural Networks. Artificial Neural Network consists of nodes, which can be called as neurons, and activation functions. Each node have a collection of weights and a bias, which can be computed through the learning process. An activation functions is to produce an output for a node depending on its input values. Every Neural Network in- cludes three kinds of layers. The input layer receives information from the records of a dataset, the output layer results in the network prediction and hidden layers connect the input and the output with one another. Mathematically, a model of Neural Network is defined as follow (Benning 2020, p. 28): CHAPTER 4. METHODS 24 fw(x) = ϕL(ϕL−1(...ϕ1(ϕ0(x,w1, b1), w2, b2)..., wL−1, bL−1)wL, bL) (4.7) The equation 4.7 how a Neural Network of L layers is presented mathematically. We have {ϕl}Ll=1 is a set of L activation functions that contain weights w = {wL}Ll=1 and biases b = {bL}Ll=1. Typically, an activation function is in form of affine-linear transformation, which is defined as ϕ(x,W, b) = W Tx+ b (4.8) where x ∈ Rn represents number of inputs, W ∈ Rn×m is a weighted matrix and b ∈ Rm is a bias vector. By this way, the activation function maps n inputs onto m outputs. In this study, we only used this type of activation functions in the output layer. For the input and hidden layers, Rectified Linear Unit (ReLU) is chosen. The function is as the following: ϕ(x,W, b) = max(0,W Tx+ b) (4.9) The function is a combination of the affine-linear transformations and the rectifier, ϕ(x) = max(0, x). An advantage of using ReLU is that it is computational efficient due to its simplicity. This contributes to make ReLU become one of the most popular activation functions used for Neural Networks. However, this function is not flawless as there appears a problem called ’Dying ReLU’. When the function produces an output of zero or a negative value, the gradient of the function becomes zero. Thus, the process of backpropagation, which is mention in the subsection below, can not perform in that neuron, making it turn off. As we train a Network that appears to have those neurons, we may end up having a large part of the network doing nothing. A solution for the Dying ReLU is that we can choose to use the so-called Leaky ReLU, where we adjust the function so its gradient has a small slope for negative values. Nevertheless, we ignored this issue and only used ReLU for the hidden layers in this study. We would like to try the Leaky ReLU in our future work for further experiments with Neural Networks. CHAPTER 4. METHODS 25 In order to produce an accurate prediction, the output layer needs to go through a loss function to match with the actual values closely. Combined with the choice of loss function, a model of Neural Networks can be generalised into this problem: arg min w,b { 1 s s∑ i=1 li(fw(xi), yi) } (4.10) where i ∈ {1, ...,s} and {li}si=1 is a family of loss functions. For the regression problem of this study, we choose the least-squares, li(x) = 1 2(fw(xi)− y)2 in order to optimise the problem 4.10 as follow: arg min w,b { 1 2s s∑ i=1 (fw(xi)− y)2 } Since we have a differentible neural network with differentible activation func- tions. We can determine an algorithm to train the model. Our choice in this task is Adam. 4.6.1 Adam Algorithm Adaptive Moment Estimation (Adam) is an extension of Stochastic Gradient De- scent (SGD) (Kingma & Ba 2014). While SGD keeps only one learning for all weights and does not adjust the rate during training, Adam provides a learning rate for each weight and separtely adjust the rate through the training process. The authors also claims that Adam inherits the advantages of Adaptive Gradient Algorithm (Adagrad) and Root Mean Square Propagation (RMSProp). The ben- efit of the former is to improve performance with sparse gradients and the benefit of the latter is to improve the performance on online and non-stationary problems. As a result, Adam is insisted to be effective to solve practical problems related to the use of Neural Network (Kingma & Ba 2014). The algorithm of Adam is as follow regard to the original paper (Kingma & Ba 2014): CHAPTER 4. METHODS 26 Algorithm 8 Adam Specify: f(θ): Stochastic objective function with parameters θ Specify: α, β1, β2 ∈ [0, 1),  Initialise: θ0 m0 = 0 v0 = 0 for t = 1, ...,T do compute gt = ∇θft(θt−1) compute g2t = gt gt ( : Hadamard product (A)) compute mt = β1mt−1 + (1− β1)gt compute vt = β2vt−1 + (1− β2)g2t compute mˆt = mt 1−βt1 compute vˆt = vt 1−βt2 compute θt = θt−1 − αmˆt√vˆt+ end for return θT In order to implement Algorithm 8, we need to compute the gradients with respect to the weights in different layers. The process of this computation is called backpropagation. 4.6.2 Backpropagation Backward propagation of errors (Backpropagation) is the practice of readjusting the weights and biases of a Neural Network based on the error rate obtained in the previous epoch of the training process. The t

Các file đính kèm theo tài liệu này:

  • pdfdisquisitiones_arithmetic_predicting_the_prices_for_breakfas.pdf
Tài liệu liên quan