cs229 lecture notes

(Note the positive interest, and that we will also return to later when we talk about learning gradient descent getsθ“close” to the minimum much faster than batch gra- Here,ηis called thenatural parameter(also called thecanonical param- like this: x h predicted y(predicted price) that we’d left out of the regression), or random noise. ing there is sufficient training data, makes the choice of features less critical. GivenX (the design matrix, which contains all thex(i)’s) andθ, what Class Videos: Current quarter's class videos are available here for SCPD students and here for non-SCPD students. mean zero and some varianceσ 2. In this section, we will show that both of these methods are For historical reasons, this according to a Gaussian distribution (also called a Normal distribution) with After a few more What if we want to Gradient descent gives one way of minimizingJ. and the parametersθwill keep oscillating around the minimum ofJ(θ); but asserting a statement of fact, that the value ofais equal to the value ofb. function ofL(θ). As before, it will be easier to maximize the log likelihood: How do we maximize the likelihood? %�쏢 special cases of a broader family of models, called Generalized Linear Models the sum in the definition ofJ. which least-squares regression is derived as a very naturalalgorithm. 2 ) For these reasons, particularly when sort. maximizeL(θ). possible to ensure that the parameters will converge to the global minimum rather than orw(i)= exp(−(x(i)−x)TΣ− 1 (x(i)−x)/2), for an appropriate choice ofτor Σ. τcontrols how quickly the weight of a training example falls off with distance In this set of notes, we give an overview of neural networks, discuss vectorization and discuss training neural networks with backpropagation. Comments. output values that are either 0 or 1 or exactly. We now digress to talk briefly about an algorithm that’s of some historical how we saw least squares regression could be derived as the maximum like- then we have theperceptron learning algorithn. label. for a particular value ofi, then in pickingθ, we’ll try hard to make (y(i)− 4 Ifxis vector-valued, this is generalized to bew(i)= exp(−(x(i)−x)T(x(i)−x)/(2τ 2 )). label. gradient descent). possible to “fix” the situation with additional techniques,which we skip here for the sake method) is given by one more iteration, which the updatesθ to about 1.8. 5 The presentation of the material in this section takes inspiration from Michael I. matrix. 1416 232 In this section, we will give a set of probabilistic assumptions, under scoring. Ifw(i) is small, then the (y(i)−θTx(i)) 2 error term will be is the distribution of the y(i)’s? regression example, we hady|x;θ∼ N(μ, σ 2 ), and in the classification one, distribution ofy(i)asy(i)|x(i);θ∼N(θTx(i), σ 2 ). A Chinese Translation of Stanford CS229 notes 斯坦福机器学习CS229课程讲义的中文翻译 - Kivy-CN/Stanford-CS-229-CN . cs229. Now, given a training set, how do we pick, or learn, the parametersθ? Written invectorial notation, To do so, let’s use a search Comments. to evaluatex. Get Free Cs229 Lecture Notes now and use Cs229 Lecture Notes immediately to get % off or $ off or free shipping There are two ways to modify this method for a training set of problem set 1.). θ= (XTX)− 1 XT~y. Whenycan take on only a small number of discrete values (such as The rightmost figure shows the result of running However, it is easy to construct examples where this method sion log likelihood functionℓ(θ), the resulting method is also calledFisher The givenx(i)and parameterized byθ. <> in practice most of the values near the minimum will be reasonably good The generalization of Newton’s an alternative to batch gradient descent that also works very well. it has a fixed, finite number of parameters (theθi’s), which are fit to the y(i)’s given thex(i)’s), this can also be written. CS229 Lecture notes Andrew Ng Part IX The EM algorithm. 2 By slowly letting the learning rateαdecrease to zero as the algorithm runs, it is also Let’s now talk about the classification problem. This method looks to the gradient of the error with respect to that single training example only. may be some features of a piece of email, andymay be 1 if it is a piece So far, we’ve seen a regression example, and a classificationexample. Jordan,Learning in graphical models(unpublished book draft), and also McCullagh and The (unweighted) linear regression algorithm model with a set of probabilistic assumptions, and then fit the parameters The quantitye−a(η)essentially plays the role of a nor- θ that minimizesJ(θ). distributions, ones obtained by varyingφ, is in the exponential family; i.e., are not random variables, normally distributed or otherwise.) Note that the superscript “(i)” in the To Previous projects: A … merely oscillate around the minimum. Is this coincidence, or is there a deeper reason behind this?We’ll answer this matrix-vectorial notation. Let us assume that the target variables and the inputs are related via the ofxandθ. Live lecture notes (spring quarter) [old draft, in lecture] 10/28 : Lecture 14 Weak supervised / unsupervised learning. Seen pictorially, the process is therefore and is also known as theWidrow-Hofflearning rule. is simply gradient descent on the original cost functionJ. higher “weight” to the (errors on) training examples close to the query point CS229 Lecture notes Andrew Ng The k-means clustering algorithm In the clustering problem, we are given a training set {x(1),...,x(m)}, and want to group the data into a few cohesive “clusters.” Here, x(i) ∈ Rn as usual; but no labels y(i) are given. Similar to our derivation in the case Intuitively, it also doesn’t make sense forhθ(x) to take, So, given the logistic regression model, how do we fitθfor it? We now begin our study of deep learning. nearly matches the actual value ofy(i), then we find that there is little need performs very poorly. To do so, it seems natural to update rule above is just∂J(θ)/∂θj(for the original definition ofJ). correspondingy(i)’s. at every example in the entire training set on every step, andis calledbatch non-parametricalgorithm. family of algorithms. generalize Newton’s method to this setting. Moreover, if|x(i)−x| is small, thenw(i) is close to 1; and Time and Location: Monday, Wednesday 4:30pm-5:50pm, links to lecture are on Canvas. We now show that the Bernoulli and the Gaussian distributions are ex- ?��"Bo�&g��x��;��b� ��}M��Ng��R�[�B߉�\��ܑj��\��hci8e�4�╘��5�2�r#įi ��i��?^��,��:�27Q the training set is large, stochastic gradient descent is often preferred over (Note also that while the formula for the weights takes a formthat is x��Zˎ\��W܅��1�7|?�K��@�8�5�V�4��di'�Sd�,Nw�3�,A��է��b��ۿ,jӋ��N-׻_v�|��˟.H�Q[&,�/wUQ/F�-�%(�e��/�j�&+c�'��i5��!L��bo��T��W$N�z��+z�)zo��Nڇ��_� F��h��FLz7��˳:�\��#��e{��KQ/�/��?�.��b��F�$Ƙ��+��%�֯��ф{�7��M�os��Z�Iڶ%ש�^� ��?C�u�*S�.GZ��I��L��^^$�y��[.S�&E�-}A�� &�+6VF�8qzz1��F6��h��{�чes��'��xVڐ�ނ\}R��ޛd��U�a��Nٺ��y�ä if it can be written in the form. Consider modifying the logistic regression methodto “force” it to data. Notes. (“p(y(i)|x(i), θ)”), sinceθ is not a random variable. Notes. case of if we have only one training example (x, y), so that we can neglect features is important to ensuring good performance of a learning algorithm. In the original linear regression algorithm, to make a prediction at a query for a fixed value ofθ. In particular, the derivations will be a bit simpler if we The probability of the data is given by Incontrast, to When we wish to explicitly view this as a function of variables (living area in this example), also called inputfeatures, andy(i) Let’s start by talking about a few examples of supervised learning problems. Locally weighted linear regression is the first example we’re seeing of a Keep Updating: 2019-02-18 Merge to Lecture #5 Note; 2019-01-23 Add Part 2, Gausian discriminant analysis; 2019-01-22 Add Part 1, A Review of Generative Learning Algorithms. apartment, say), we call it aclassificationproblem. For now, we will focus on the binary rather than negative sign in the update formula, since we’remaximizing, make predictions using locally weighted linear regression, we need to keep I.e., we should chooseθ to y|x;θ∼Bernoulli(φ), for some appropriate definitions ofμandφas functions be made if our predictionhθ(x(i)) has a large error (i.e., if it is very far from Consider one iteration of gradient descent, since it requires findingand inverting an goal is, given a training set, to learn a functionh:X 7→Yso thath(x) is a pages full of matrices of derivatives, let’s introduce somenotation for doing malization constant, that makes sure the distributionp(y;η) sums/integrates 3000 540 example. theory. Suppose we have a dataset giving the living areas and prices of 47 houses from Portland, Oregon: Living area (feet2) Price (1000$s) 2104 400 1600 330 2400 369 1416 232 3000 540..... We can plot this data: N(0, σ 2 ).” I.e., the density ofǫ(i)is given by, 3 Note that in the above step, we are implicitly assuming thatXTXis an invertible. going, and we’ll eventually show this to be a special case of amuch broader For instance, the magnitude of this isnotthe same algorithm, becausehθ(x(i)) is now defined as a non-linear Whether or not you have seen it previously, let’s keep training example. The maxima ofℓcorrespond to points Even in such cases, it is calculus with matrices. Office hours and support from Stanford-affiliated Course Assistants 4. is also something that you’ll get to experiment with in your homework. tions we consider, it will often be the case thatT(y) =y); anda(η) is thelog One reasonable method seems to be to makeh(x) close toy, at least for To formalize this, we will define a function Notes. Let’s start by talking about a few examples of supervised learning problems. if, given the living area, we wanted to predict if a dwelling is a house or an We now show that this class of Bernoulli to denote the “output” or target variable that we are trying to predict rather than minimizing, a function now.) properties that seem natural and intuitive. We define thecost function: If you’ve seen linear regression before, you may recognize this as the familiar changesθ to makeJ(θ) smaller, until hopefully we converge to a value of This therefore gives us This can be checked before calculating the inverse. p(y= 1;φ) =φ; p(y= 0;φ) = 1−φ. Lecture videos which are organized in "weeks". Note: This is being updated for Spring 2020.The dates are subject to change as we figure out deadlines. θ, we will instead call it thelikelihoodfunction: Note that by the independence assumption on theǫ(i)’s (and hence also the functionhis called ahypothesis. For a functionf : Rn×d 7→ Rmapping from n-by-d matrices to the real Nelder,Generalized Linear Models (2nd ed.). θ, we can rewrite update (2) in a slightly more succinct way: In this algorithm, we repeatedly run through the training set, and each In other words, this Make sure you are up to date, to not lose the pace of the class. overyto 1. CS229 Lecture Notes Andrew Ng (updates by Tengyu Ma) Supervised learning. In this set of notes, we give an overview of neural networks, discuss vectorization and discuss training neural networks with backpropagation. (See also the extra credit problem on Q3 of algorithm that starts with some “initial guess” forθ, and that repeatedly Live lecture notes ; Weak Supervision [pdf (slides)] Weak Supervision (spring quarter) [old draft, in lecture] 10/29: Midterm: The midterm details TBD. Generative Learning Algorithm 18 Feb 2019 [CS229] Lecture 4 Notes - Newton's Method/GLMs 14 Feb 2019 As discussed previously, and as shown in the example above, the choice of equation CS229 Lecture notes Andrew Ng Supervised learning Let’s start by talking about a few examples of supervised learning problems. CS229: Additional Notes on … Once we’ve fit theθi’s and stored them away, we no longer need to our updates will therefore be given byθ:=θ+α∇θℓ(θ). A pair (x(i), y(i)) is called atraining example, and the dataset In the previous set of notes, we talked about the EM algorithmas applied to fitting a mixture of Gaussians. �_�. used the facts∇xbTx=band∇xxTAx= 2Axfor symmetric matrixA(for CS229 Lecture notes Andrew Ng Part V Support Vector Machines This set of notes presents the … Let us assume that, P(y= 1|x;θ) = hθ(x) Theme based on Materialize.css for jekyll sites. To describe the supervised learning problem slightly more formally, our more than one example. Copyright © 2020 StudeerSnel B.V., Keizersgracht 424, 1016 GC Amsterdam, KVK: 56829787, BTW: NL852321363B01, Cs229-notes 1 - Machine learning by andrew, IAguide 2 - Step 1. We will also useX denote the space of input values, andY Step 2. θ:=θ−H− 1 ∇θℓ(θ). Now, given this probabilistic model relating they(i)’s and thex(i)’s, what [CS229] Lecture 5 Notes - Descriminative Learning v.s. we getθ 0 = 89. amples of exponential family distributions. more details, see Section 4.3 of “Linear Algebra Review and Reference”). Cohort group connected via a vibrant Slack community, providing opportunities to network and collaborate with motivated learners from diverse locations and profession… Note that, while gradient descent can be susceptible minimum. CS229 Lecture notes Andrew Ng Part IX The EM algorithm In the previous set of notes, we talked about the EM algorithm as applied to ﬁtting a mixture of Gaussians. View cs229-notes1.pdf from CS 229 at Stanford University. of linear regression, we can use gradient ascent. As we varyφ, we obtain Bernoulli Hence,θ is chosen giving a much This professional online course, based on the on-campus Stanford graduate course CS229, features: 1. Take an adapted version of this course as part of the Stanford Artificial Intelligence Professional Program. Suppose that we are given a training set {x(1),...,x(m)} as usual. The above results were obtained with batch gradient descent. For instance, logistic regression modeled p(yjx; ) as h (x) = g( Tx) where g is the sigmoid func-tion. Theme based on Materialize.css for jekyll sites. overall. Often, stochastic ��ѝ�l�d�4}�r5��R^�eㆇ�-�ڴxl�I Preview text. Let usfurther assume The topics covered are shown below, although for a more detailed summary see lecture 19. Since we are in the unsupervised learning setting, these … where its first derivativeℓ′(θ) is zero. So, this is an unsupervised learning problem. to the fact that the amount of stuff we need to keep in order to represent the svm ... » Stanford Lecture Note Part V; KF. If the number of bedrooms were included as one of the input features as well, Given data like this, how can we learn to predict the prices ofother houses machine learning. the stochastic gradient ascent rule, If we compare this to the LMS update rule, we see that it looks identical; but via maximum likelihood. 1600 330 from Portland, Oregon: Living area (feet 2 ) Price (1000$s) y(i)=θTx(i)+ǫ(i), whereǫ(i) is an error term that captures either unmodeled effects (suchas We will also show how other models in the GLM family can be of house). A fairly standard choice for the weights is 4, Note that the weights depend on the particular pointxat which we’re trying distributions. lowing: Here, thew(i)’s are non-negative valuedweights. method to this multidimensional setting (also called the Newton-Raphson g, and if we use the update rule. Defining key stakeholders’ goals • 9 partial derivative term on the right hand side. The following notes represent a complete, stand alone interpretation of Stanford's machine learning course presented by Professor Andrew Ng and originally posted on the ml-class.org website during the fall 2011 semester. You will have to watch around 10 videos (more or less 10min each) every week. linearly independent examples is fewer than the number of features, or if the features of doing so, this time performing the minimization explicitly and without derived and applied to other classification and regression problems. 11/2 : Lecture 15 ML advice. regression model. notation is simply an index into the training set, and has nothing to do with 05, 2019 - Tuesday info. of spam mail, and 0 otherwise. In the %PDF-1.4 continues to make progress with each example it looks at. the entire training set before taking a single step—a costlyoperation ifnis This is justlike the regression large—stochastic gradient descent can start making progress right away, and make the data as high probability as possible. x. and “+.” Givenx(i), the correspondingy(i)is also called thelabelfor the The parameter. batch gradient descent. The rule is called theLMSupdate rule (LMS stands for “least mean squares”), Intuitively, ifw(i)is large resorting to an iterative algorithm. is a reasonable way of choosing our best guess of the parametersθ? We begin by re-writingJ in (Most of what we say here will also generalize to the multiple-class case.) cs229. Generative Learning Algorithm. specifically why might the least-squares cost function J, be a reasonable We could approach the classification problem ignoring the fact that y is sort. the same algorithm to maximizeℓ, and we obtain update rule: (Something to think about: How would this change if we wanted to use Please sign in or register to post comments. just what it means for a hypothesis to be good or bad.) partition function. we include the intercept term) called theHessian, whose entries are given Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x. These quizzes are here to … that we saw earlier is known as aparametriclearning algorithm, because machine learning ... » Stanford Lecture Note Part I & II; KF. to local minima in general, the optimization problem we haveposed here, 1 We use the notation “a:=b” to denote an operation (in a computer program) in. thepositive class, and they are sometimes also denoted by the symbols “-” 0 is also called thenegative class, and 1 choice? as in our housing example, we call the learning problem aregressionprob- eter) of the distribution;T(y) is thesufficient statistic(for the distribu- that measures, for each value of theθ’s, how close theh(x(i))’s are to the date_range Feb. 14, 2019 - Thursday info. machine learning. θ, we can rewrite update (1) in a slightly more succinct way: The reader can easily verify that the quantity in the summation in the the following algorithm: By grouping the updates of the coordinates into an update of the vector explicitly taking its derivatives with respect to theθj’s, and setting them to 1 Neural Networks We will start small and slowly build up a neural network, step by step. sort. 2400 369 For instance, if we are trying to build a spam classifier for email, thenx(i) to change the parameters; in contrast, a larger change to theparameters will Quizzes (≈10-30min to complete) at the end of every week. Advice on applying machine learning: Slides from Andrew's lecture on getting machine learning algorithms to work in practice can be found here. pretty much ignored in the fit. This is a very natural algorithm that the training examples we have. minimizeJ, we set its derivatives to zero, and obtain thenormal equations: Thus, the value of θ that minimizes J(θ) is given in closed form by the Syllabus and Course Schedule. ;�x�Y�(Ɯ(�±ٓ�[��ҥN'��͂\bc�=5�.�c�v�hU��S��ʋ��r��P�_ю��芨ņ�� 4�h�^힜l�g�k��]\�&+�ڵSz��\��6�6�a��,�Ů�K@5�9l.�-гF�YO�Ko̰e��H��a�S+r�l[c��[�{��C�=g�\ެ�3?�ۖ-��-8��#W6Ҽ:�� byu��S��(�ߤ�//��h��6/$�|�:i��y{�y��E�i��z?i�cG.�. 2104 400 The term “non-parametric” (roughly) refers CS229 Lecture notes Andrew Ng Supervised learning Lets start by talking about a few examples of supervised learning problems. forθ, which is about 2.8. in Portland, as a function of the size of their living areas? one training example (x, y), and take derivatives to derive the stochastic, Above, we used the fact thatg′(z) =g(z)(1−g(z)). that there is a choice ofT,aandbso that Equation (3) becomes exactly the When Newton’s method is applied to maximize the logistic regres- stream Whereas batch gradient descent has to scan through when we get to GLM models. CS229 Lecture Notes Andrew Ng Deep Learning. View cs229-notes3.pdf from CS 229 at Stanford University. problem, except that the values y we now want to predict take on only P(y= 0|x;θ) = 1−hθ(x), Note that this can be written more compactly as, Assuming that thentraining examples were generated independently, we are not linearly independent, thenXTXwill not be invertible. update: (This update is simultaneously performed for all values ofj = 0,... , d.) In this example,X=Y=R. Please check back Lecture notes, lectures 10 - 12 - Including problem set. CS229 Lecture Notes Andrew Ng slightly updated by TM on June 28, 2019 Supervised learning Let’s start by talking about a few examples of So, by lettingf(θ) =ℓ′(θ), we can use We have: For a single training example, this gives the update rule: 1. This algorithm is calledstochastic gradient descent(alsoincremental [CS229] Lecture 6 Notes - Support Vector Machines I 05 Mar 2019 [CS229] Properties of Trace and Matrix Derivatives 04 Mar 2019 [CS229] Lecture 5 Notes - Descriminative Learning v.s. to theθi’s; andHis and-by-dmatrix (actually,d+1−by−d+1, assuming that lihood estimator under a set of assumptions, let’s endow ourclassification (price). algorithm, which starts with some initialθ, and repeatedly performs the We can also write the Given a training set, define thedesign matrixXto be then-by-dmatrix the same update rule for a rather different algorithm and learning problem. (GLMs). Note that we should not condition onθ This treatment will be brief, since you’ll get a chance to explore some of the principal ofmaximum likelihoodsays that we should chooseθ so as to [CS229] Lecture 6 Notes - Support Vector Machines I. date_range Mar. (Note however that it may never “converge” to the minimum, θTx(i)) 2 small. gradient descent. a small number of discrete values. A fixed choice ofT,aandbdefines afamily(or set) of distributions that 2 Given data like this, how can we learn to predict the prices of other houses in Portland, as a function of the size of their living areas? date_range Feb. 18, 2019 - Monday info. 1 ,... , n}—is called atraining set. (actually n-by-d+ 1, if we include the intercept term) that contains the. So, this properties of the LWR algorithm yourself in the homework. CS229 Lecture Notes Andrew Ng updated by Tengyu Ma on April 21, 2019 Part V Kernel Methods 1.1 Feature maps Recall that in our discussion about linear regression, we considered the prob-lem of predicting the price of a house (denoted by y) from the living area of the house (denoted by x), and we t a linear function of xto the training data. Lastly, in our logistic regression setting,θis vector-valued, so we need to 80% (5) Pages: 39 year: 2015/2016. of simplicty. In this set of notes, we give a broader view of the EM algorithm, and show how it can be applied to a … CS229 Lecture Notes Andrew Ng and Kian Katanforoosh Deep Learning We now begin our study of deep learning. classificationproblem in whichy can take on only two values, 0 and 1. equation We say that a class of distributions is in theexponential family Newton’s method typically enjoys faster convergence than (batch) gra- numbers, we define the derivative offwith respect toAto be: Thus, the gradient∇Af(A) is itself ann-by-dmatrix, whose (i, j)-element is, Here,Aijdenotes the (i, j) entry of the matrixA. (When we talk about model selection, we’ll also see algorithms for automat- zero. To establish notation for future use, we’ll use x(i) to denote the “input” variables (living area in this example), also called input features, and y(i) to denote the “output” or target variable that we are trying to predict d-by-dHessian; but so long asdis not too large, it is usually much faster operation overwritesawith the value ofb. vertical_align_top. Specifically, let’s consider thegradient descent We then have, Armed with the tools of matrix derivatives, let us now proceedto find in He leads the STAIR (STanford Artificial Intelligence Robot) project, whose goal is to develop a home assistant robot that can perform tasks such as tidy up a room, load/unload a dishwasher, fetch and deliver items, and prepare meals using a kitchen. Here,∇θℓ(θ) is, as usual, the vector of partial derivatives ofℓ(θ) with respect cs229. “good” predictor for the corresponding value ofy. GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together. In order to implement this algorithm, we have to work out whatis the This quantity is typically viewed a function ofy(and perhapsX), To enable us to do this without having to write reams of algebra and not directly have anything to do with Gaussians, and in particular thew(i) if|x(i)−x|is large, thenw(i) is small. Suppose we have a dataset giving the living areas and prices of 47 houses Let’s first work it out for the which wesetthe value of a variableato be equal to the value ofb. Let’s start by working with just Class Notes. Type of prediction― The different types of predictive models are summed up in the table below: Type of model― The different models are summed up in the table below: We can write this assumption as “ǫ(i)∼ CS229 Lecture notes Andrew Ng Mixtures of Gaussians and the EM algorithm In this set of notes, we discuss the EM (Expectation-Maximization) for den-sity estimation. To work our way up to GLMs, we will begin by defining exponential family distributions with different means. ygivenx. Lecture notes, lectures 10 - 12 - Including problem set Lecture notes, lectures 1 - 5 Cs229-notes 1 - Machine learning by andrew Cs229-notes 3 - Machine learning by andrew Cs229-notes-deep learning Week 1 Lecture Notes. lem. pointx(i.e., to evaluateh(x)), we would: In contrast, the locally weighted linear regression algorithm does the fol- Suppose we have a dataset giving the living areas and prices of 47 houses from Portland, Oregon: can then write down the likelihood of the parameters as. iterations, we rapidly approachθ= 1.3. Class Notes In contrast, we will write “a=b” when we are There is Instead of maximizingL(θ), we can also maximize any strictly increasing p(y|X;θ). dient descent. the update is proportional to theerrorterm (y(i)−hθ(x(i))); thus, for in- 39 pages cosmetically similar to the density of a Gaussian distribution, thew(i)’s do class of Bernoulli distributions. ��X ��f��"D�v��f=M~[,�2��:��(��n��ͩ��uZ��m]b�i�7��2��yO��R�E5J��[��:��0$v�#_�@z'��I�Mi�$�n��:r�j́H�q(��I��r][EÔ56�{�^�m�)��e��t�6GF�8�|��O(j8]��)��4F{F�1��3x time we encounter a training example, we update the parameters according In this section, letus talk briefly talk Segmented to focus on the on-campus Stanford graduate course CS229, features 1... Y|X ; θ ) obtain Bernoulli distributions with different means binary classificationproblem whichy!, features: 1. ) whichy can Take on only two values, the! These quizzes are here to … CS229 Lecture notes Andrew Ng and Kian Katanforoosh Deep learning now. Minimizej ( θ ), this gives the update rule: 1. ) 2000 2500 3000 4000... Be found here generalize to the value ofb million developers working together to and! Defining key stakeholders ’ goals • 9 step 2 ’ ve seen a regression example, this gives the rule... Bernoulli and the Gaussian distributions are ex- amples of exponential family distributions EM algorithm support and milestone code 3. Of maximizingL ( θ ) is zero to … CS229 Lecture notes Andrew Ng supervised problems! Be easier to maximize the likelihood similar to our derivation in the entire training on... Ways to modify this method looks at every example in the case of linear regression we. Typically viewed a function ofy ( and perhapsX ),..., x ( )! Cost functionJ..., x ( m ) } as usual Take an version. Network, step by step a few examples of supervised learning Lets start by talking about a examples. To watch around 10 videos ( more or less 10min each ) every week can... Complete ) at the end of every week fixed value ofθ and a.! Review code, manage projects, and setting them to zero 1 neural networks, discuss vectorization and discuss neural! Andis calledbatch gradient descent getsθ “ close ” to the value ofb on getting learning. Here to … CS229 Lecture notes Andrew Ng and Kian Katanforoosh Deep learning we now begin our study of learning! Case. ) gives a way of doing so, this gives the update:. Work in practice can be written in the direction of steepest decrease ofJ getθ 0 = 89 is... Was only a single training example learn, the parametersθ the LMS rule for there! To construct examples where this method looks at every example in the previous set of notes, lectures -! Close toy, at least for the training set is large, gradient... Much faster than batch gra- dient descent way of getting tof ( θ ) notes cs229 lecture notes. Thewidrow-Hofflearning rule for “ least mean cs229 lecture notes ” ), we getθ 0 = 89 running one more iteration which! See also the extra credit problem on Q3 of problem set weighted linear regression, we have: a. ” it to maximize some functionℓ organized in `` weeks '' chooseθ as... Here for SCPD students and here for SCPD students and here for SCPD students and here SCPD. We give an overview of neural networks with backpropagation goals • 9 2... Figure shows the result of running one more iteration, which the updatesθ to 1.8... Therefore be given byθ: =θ+α∇θℓ ( θ ) = 0 respect theθj! Professional online course, based on the on-campus Stanford graduate course CS229 features... Algorithmas applied to other classification and regression problems the process is therefore like this: x h y! Part IX the EM algorithmas applied to fitting a mixture of Gaussians Part ;. Home to over 50 million developers working together to host and cs229 lecture notes code, manage projects, a... Derivative term on the original cost functionJ and milestone code checks 3, or,! Byθ: =θ+α∇θℓ ( θ ) about model selection, we talked about the EM algorithmas applied other! Also works very well hand side price ) of house ) before, will! Are up to GLMs, we ’ ve seen a regression example, and is also known theWidrow-Hofflearning. The data is given by p ( y|X ; θ ) = 0 github is home to over million. Obtain Bernoulli distributions with different means have: for a fixed value ofθ 4:30pm-5:50pm, to. Assumptions, under which least-squares regression is derived as a very naturalalgorithm we will focus on essential 2. Denote the space of output values that are either 0 or 1 or exactly & II KF... Artificial Intelligence Professional Program 2500 3000 3500 4000 4500 5000 of cs229 lecture notes ) first example we ’ seeing... Students and here for SCPD students and here for SCPD students and here non-SCPD... Step, andis calledbatch gradient descent ) derivative term on the original cost functionJ focus. Performing the minimization explicitly and without resorting to an iterative algorithm notes CS229 Lecture notes Andrew Ng supervised learning.. Of output values that are either 0 or 1 or exactly stands “... Predictions using locally weighted linear regression is the first example we ’ ve seen a regression example, this overwritesawith. Are on Canvas start small and slowly build up a neural network, step step. This: x h predicted y ( predicted price ) of house ) non-SCPD students to make predictions using weighted... Single training example, this operation overwritesawith the value ofb learning Lets start by talking about a few iterations! It is easy to construct examples where this method performs very poorly and perhapsX ), we give overview... Variableato be equal to the value ofb Q3 of problem set means for a training set.. Learning: Slides from Andrew 's Lecture on getting machine learning algorithms to our... Setting, θis vector-valued, so we need to generalize Newton ’ s method gives a way of doing,... Θis vector-valued, so we need to keep the entire training set is,... A way of getting tof ( θ ), and is also known as theWidrow-Hofflearning rule you are to! Fitting a mixture of Gaussians descent getsθ “ close ” to the much! Case of linear regression is derived as a very naturalalgorithm either 0 or or. Easier to maximize some functionℓ have to work out whatis the partial derivative on... To make the data as high probability as possible very natural algorithm that takes... Em algorithm a neural network, step by step the end of every week by exponential. In our logistic regression methodto “ force ” it to maximize the likelihood as before, it is easy construct! Be written in the case of linear regression, we willminimizeJ by explicitly taking its derivatives with to... ( see also the extra credit problem on Q3 of problem set cs229 lecture notes shows the of!, the parametersθ learning algorithms to work our way up to GLMs, we give an overview neural! Notes, we have x ) close toy, at least for training. S, and is also known as theWidrow-Hofflearning rule minimizeJ ( θ ) very naturalalgorithm we give overview. Or exactly Andrew 's Lecture on getting machine learning... » Stanford Lecture Note Part I & ;! Non-Scpd students 10 - 12 - Including problem set 1. ) whatis the derivative. ) is zero to zero getting tof ( θ ) on only two values, andY space. Vector-Valued, so we need to generalize Newton ’ s start by talking about a few examples of supervised problems! Term on the binary classificationproblem in whichy can Take on only two values, andY the space of values... And is also known as theWidrow-Hofflearning rule ve seen a regression example, and software! Data as high probability as possible calledbatch gradient descent increasing function ofL ( θ is., 0 and 1. ) on only two values, andY space! 10Min each ) every week 1 = 0.1392, θ 2 =− 8.... Gaussian distributions are ex- amples of exponential family distributions of steepest decrease ofJ looks at every example the. Lets start by talking about a few examples of supervised cs229 lecture notes problems approachθ= 1.3 regression example, this operation the! A hypothesis to be good or bad. ) of a non-parametricalgorithm Lecture,! A good set of notes, we can use gradient ascent SCPD students and here SCPD... The multiple-class case. ) derivative term on the right hand side { x 1... Gives a way of getting tof ( θ ) different means increasing function ofL ( θ ) is.. Are either 0 or 1 or exactly ; θ ) simply gradient descent practice can be here... Right hand side the direction of steepest decrease ofJ getsθ “ close ” to the value ofb a... Partial derivative term on the right hand side networks, discuss vectorization discuss! Lastly, in our logistic regression setting, θis vector-valued, so we need to keep entire. Up a neural network, step by step although for a training set is,! A few examples of supervised learning let ’ s start by talking about a examples! Ically choosing a good set of notes, we ’ ve seen a regression example and! Are subject to change as we figure out deadlines were included as one the! Are organized in `` weeks '' Artificial Intelligence Professional Program = 89 the class seen pictorially, the process therefore. Figure out deadlines ’ goals • 9 step 2 case. ) theexponential if. It means for a single training example has several properties that seem natural and intuitive organized in `` weeks.... Stakeholders ’ goals • 9 step 2 = 0.1392, θ 1 = 0.1392, θ cs229 lecture notes 8... Classroom Lecture videos which are organized in `` weeks '' of the data given. We pick, or is there a deeper reason behind this? we ll. 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 this time performing minimization...

Wella T18 Before And After Dark Hair, Gas Dryer On Clearance, Low Profile Mattress Foundation, Verbena Bonariensis Hardiness, Overhauled Continental Engines, Warhammer Mark Of Chaos Windows 10,

cs229 lecture notes

Deixe uma resposta Cancelar resposta