**Math for Machine Learning**1

Get Started. It's Free
or sign up with your email address
Rocket clouds
**Math for Machine Learning**1 by Mind Map: **Math for Machine Learning**1

1. **Multivariate Derivatives** :silhouettes:

1.1. **Convexity**

1.1.1. **Derivative Condition** Hf is called positive semi-definite if ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-acae100d-fc8e-493b-9291-afcb3cab77d4.png 134x32) And this implies f is convex

1.1.2. **A Warning** When going to more complex models i.e. Neural Networks, there are many local minima & many saddle points. So they are not convex

1.1.3. Opposite is Concave ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-1772e010-d755-488e-b5c5-37686515ff9f.png 119x107)

1.1.4. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-a24ee319-4fa9-4b7a-a0ef-dbe9f31d3b12.png 128x129) the function is convex if the line between two points stays above

1.1.5. **Benefits** When a function is convex then there is * a single **unique** local minimum * no maxima * no saddle points * **Gradient descent** is guarantied to find the global minimum with ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-b326f4e0-4ce8-4eab-b582-a41fcdc3f086.png 20x29) small enough * Newton's Method always works

1.2. **The Gradient**

1.2.1. **Definitions**

1.2.1.1. **Matrix derivative** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-acdb3fb4-6aee-4aa9-b65b-1e8c8a1eb02b.png 150x69)

1.2.1.2. **Vector derivative** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-f08bf401-34a7-4428-ac05-a0b58c99dd3a.png 35x30) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-2e4f1650-0561-456e-9219-9b30301904f1.png 114x30)

1.2.1.3. **The Gradient** collection of partial derivatives ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-0724b4f0-f3ec-41ee-9af0-6e0f8e102b06.png 150x47) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-6fa44623-baeb-4fe1-9ccb-7466392a859c.png 65x46)

1.2.2. **Gradient Descent** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-d60c02e7-bccb-4c18-bbcd-348047dbf20d.png 137x26) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-a6fd63fe-b319-4eff-9c60-fd3b04f80a83.png 150x91)

1.2.3. **Level set** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-46a06e0a-2147-4a0b-96d0-e21ad5f2c2e0.png 140x150)

1.2.4. **Visualize** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-10a51bcf-dbd3-46b8-82e1-a17db9265916.png 150x125)

1.2.5. **Key Properties**

1.2.5.1. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-3dbe8add-a456-40e6-b89c-b7c1fa83e4ff.png 21x22) points in the direction of **maximum increase**

1.2.5.2. -![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-6feb620d-ede4-4163-862b-0ea20c1a86ee.png 21x22) points in the direction of **maximum decrease**

1.2.5.3. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-c7d4b6a1-fe6f-4723-a87e-cbff3a42c39c.png 52x46) at local max & min

1.3. **Second Derivative**

1.3.1. **Hessian**

1.3.1.1. **2D intuition**

1.3.1.1.1. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-9125283b-447e-47ca-8025-1830d41d6acc.png 88x24) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-4cd2a446-4d46-4077-82fa-0b4224cf9b4e.png 84x106) Hf= ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-969428d0-2092-4b0f-aaae-281507412448.png 42x36)

1.3.1.1.2. **Critical points** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-3dbe8add-a456-40e6-b89c-b7c1fa83e4ff.png 21x22) =0

1.3.1.1.3. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-e4452973-0d41-494d-9d28-34b9f59b62ab.png 85x23) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-00f10beb-cf0c-4143-b8e6-0ff7a4a3898d.png 89x110) Hf= ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-0cd0749f-0f07-4965-a6c6-a2a8ec2bbae2.png 51x36)

1.3.1.1.4. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-89dab9bd-745b-441a-9bc8-be12559cd26d.png 95x24) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-2a2d6d79-ae71-4203-8430-4ddd1bfff260.png 93x114) Hf= ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-55e62a18-7ba9-40ad-9c0a-cce1f6c5c423.png 60x36)

1.3.1.2. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-b688a4ed-1871-421d-8855-3e17759ba6f3.png 105x74)

1.3.1.3. If the matrix is diagonal, a positive entry is a direction where it curves up, and a negative entry is a direction where it curves down

1.3.2. **Trace** sum of diagonal terms tr(Hf)

1.3.3. For ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-889fc2c8-fe35-4a5b-ac08-c076ca09d270.png 78x28), there is ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-bbaa1cb8-093a-4924-b062-81b51ed68b72.png 23x26) many derivatives

1.3.4. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-6851f3fd-89a7-412c-99cd-0234e3ca96cb.png 107x55)

1.4. If you have a function ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-889fc2c8-fe35-4a5b-ac08-c076ca09d270.png 78x28) where n - n-dimensional vector, Then it's a function of many variables. You need to know how the function responds to changes in all of them. The majority of this will be just bookkeeping, but will be terribly messy bookkeeping.

1.5. **Partial Derivatives** is a measure of the rate of change of the function... when one of the variables is subjected to a small change but the others are kept constant. **Example**: ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-1fd6a19b-e26f-45d2-90f5-ed8b9358f86c.png 119x83)

1.6. **Newton Method**

1.6.1. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-4bb14722-f7fd-4382-8c44-f81635b1bf89.png 36x23) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-f8363985-1416-4dfe-94fb-5f5c66dc10ce.png 150x22)

1.6.2. The computational complexity of inverting an *n x n* matrix is not actually known, but the best-known algorithm is *O(n^2.373)* For high dimensional data sets, anything past linear time in the dimensions is often impractical, so Newton's Method is reserved for a few hundred dimentions at most.

1.7. **Matrix Calculus** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-926be75a-5ea3-4ef0-b484-9b09bb39f651.png 89x81)

1.7.1. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-3d516a44-b032-4d27-986f-cfdf59e1d5b2.png 69x46)

1.7.1.1. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-3fc599e0-6754-4a3b-ba78-fd00ac10d498.png 56x30)

1.7.2. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-d8a1ccab-ef13-4b9b-934d-d7edd83d73d9.png 78x46)

1.7.2.1. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-affc0ae4-4764-4014-a633-6febf7c00b79.png 70x22)

1.7.3. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-14e0fcad-3de4-4c4d-a07a-5d9b94449b95.png 65x46)

1.7.3.1. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-734b6525-4de8-4642-8803-5f415f296068.png 81x55)

1.7.4. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-69924399-1c5d-4217-a55f-041d183e7ebc.png 58x46)

1.7.4.1. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-734b6525-4de8-4642-8803-5f415f296068.png 81x55)

2. **Vectors** :arrow_right:

3. **Univariate Derivatives** :silhouette:

3.1. **Newthon's method**

3.1.1. **Idea**

3.1.1.1. Minimizing f <---> f '(x)=0

3.1.1.1.1. Now look for an algorithm to find the zero of some function g(x)

3.1.1.1.2. Apply this algorithm to f '(x)

3.1.2. **Computing the Line**

3.1.2.1. line: on (x 0, g(x 0)) slope g '(x 0) y=g '(x 0) (x-x 0)+g(x 0) solve the equation y=0

3.1.2.2. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-6ad06115-54b5-4a38-9423-8760bc0b773c.png 118x50)

3.1.3. **Relationship to Gradient Descent** A learning rate is adaptive to f(x) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-8920a196-9865-4006-a3b1-245cb4afe83c.png 51x48)

3.1.4. **Update Step for Zero Finding**

3.1.4.1. we want to find where **g(x)=0** and we start with some initial guess x0 and then iterate

3.1.4.2. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-4109463b-fb42-42ae-987f-36e84e35f884.png 130x50)

3.1.5. **Pictorially** g(x) x such that g(x)=0 ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-98745222-28b6-4771-b050-d8b2eb1ceed6.png 150x113)

3.1.6. **Update Step for Minimization**

3.1.6.1. To minimize **f**, we want to find where **f '(x)=0** and thus we may start with some initial guess x0 and then iterate Newton's Method on **f '** to get

3.1.6.2. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-137a3d56-ba2b-4e9d-ad35-8d1b53fe5148.png 122x50)

3.2. **Gradient Descent**

3.2.1. As simplistic as this is, **almost all** machine learning you have heard of use **some version** of this in the **learning process**

3.2.2. **Goal**: Minimize f(x)

3.2.3. **Issues**:

3.2.3.1. * how to pick eta

3.2.3.2. * recall that an **improperly chosen** learning rate will cause the entire optimization procedure to either **fail** or **operate too slowly** to be of practical use. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-94d81415-80c6-43b6-8252-99648704e018.png 102x105) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-086e7c75-3a97-47df-ae67-05ed586a797c.png 103x105)

3.2.3.3. * Sometimes we can **circumvent** this issue.

3.2.4. **ALGORITHM**

3.2.4.1. **1**. Start with a guess of X0

3.2.4.2. **2.** Iterate through ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-bad60875-5b7e-4c05-bec0-67d23e0d26a6.png 127x29) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-b326f4e0-4ce8-4eab-b582-a41fcdc3f086.png 21x30) - is learning rate

3.2.4.3. **3.** Stop after some condition is met * if the value if x doesn't change more than 0.001 * a fixed number of steps * fancier things TBD

3.3. **Maximum Likelihood Estimation**

3.3.1. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-47756842-f681-48eb-bdff-099bf3c63a04.png 150x84)

3.3.2. find **p** such that **Pp(D)** is maximized

3.4. **Second Derivative**

3.4.1. **f''(x)** shows how the slope is changing

3.4.1.1. **max** -> f '' < 0 **min** -> f '' > 0 **Can't tell** -> f '' = 0 proceed with higher derivative

3.5. **Derivative**

3.5.1. Can be presented as: ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-cf493c11-d9bf-426c-bb3b-59e28698cde9.png 150x35)

3.5.2. **Interpretation**

3.5.2.1. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-1be21ad6-d439-4af2-95b2-3e505684bea4.png 113x87)

3.5.2.2. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-bfee6959-4d29-43bf-b73e-fbe765544e22.png 102x95) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-65184e01-0e67-4ac1-aa91-bc83ff410305.png 101x92)

3.5.2.3. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-4947e399-c32f-4ab6-a6e9-66a2222fa29a.png 113x98)

3.5.3. let's approximate ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-1d28e2c6-194d-4292-b30c-1d90bbcc142e.png 117x46)

3.5.4. better approximation ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-81731e0d-f6ab-4e31-a1a0-c45cd4621361.png 150x37) **f(x +є) = f(x) + f'(x)є**

3.5.5. **Rules**

3.5.5.1. **Chain Rule** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-25259bc4-94a5-4864-89f1-b7b6f411004c.png 150x25)

3.5.5.1.1. **Alternative** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-773992af-a800-4dfa-852c-b428b9644816.png 100x49)

3.5.5.2. **Product Rule** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-51fefa56-ca87-4d1f-994b-1b4aa2148ab8.png 150x18)

3.5.5.3. **Sum Rule** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-a54bc0ae-c205-4ee8-b99b-789a5746be4e.png 150x24)

3.5.5.4. **Quotient Rule** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-d603949b-8f55-47f7-b465-66fc8cbbd7d8.png 150x34)

3.5.6. **Most usable**

3.5.6.1. http://hyperphysics.phy-astr.gsu.edu/hbase/Math/derfunc.html

3.5.6.2. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-a718a9e9-08b7-42f6-812b-79b2ddd525cb.png 120x25)

3.5.6.3. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-e88e2678-788f-4769-ba6b-96c914cf7c31.png 92x30)

3.5.6.4. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-f900859c-cf26-4935-9dee-f7f73f2a1ac4.png 100x45)

4. **Matrices** :bookmark_tabs:

4.1. **A motivating example**

4.2. **Matrix multiplication and examples** ![noun_transform_476616](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-5140dde8-e85d-40da-bb7e-4389a79ffb86.png 77x77)

4.3. **Dot product and how to extract angles**

4.4. **Matrix product properties**

4.4.1. **Matrix Products**

4.4.1.1. **Distributativity** A(B+C) = AB +AC

4.4.1.2. **Associativity** A(BC)=(AB)C

4.4.1.3. **Not commutativity** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-7f637e31-5d6e-4b6c-90fd-ae711a15d003.png 150x53) AB!=BA

4.4.2. **The Identity Matrix** IA=A ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-dd1a7320-99ad-4293-8c4f-07de15b92153.png 150x37) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-4e23b79f-3487-427c-9d6d-b2bd82a9ec78.png 130x73) All ones on the diagonal

4.4.3. **Properties of the Hadamard Product**

4.4.3.1. **Distributativity** A(B+C) = AB +AC

4.4.3.2. **Associativity** A(BC)=(AB)C

4.4.3.3. **Commutativity** AoB=BoC

4.5. **Hadamard product**

4.5.1. An (often less useful) method of multiplying matrices is element-wise **AoB**

4.6. **Determinant computation**

4.6.1. **The Two-by-two** det(A)=ad-bc

4.6.2. **Larger Matrices** m determinants of (m-1)x(m-1) matrices computer does it simplier Q(m^3) times called matrix factorizations

4.7. **Linear dependence**

4.7.1. det(A)=0 only if columns of A are linearly dependent

4.7.2. **Definition** lies in lower dimentional space if there are some a-s, that a1\*v1+a2\*v2+...+ak\*vk=0

4.7.3. **Example** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-2da29b28-dd84-4811-b8bb-24070f141d85.png 150x51) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-42ad9dd1-5490-4ab2-9c09-4c689756e96e.png 150x50) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-50d691d8-d256-4fa7-a001-9b599f607875.png 132x62) a1=1, a2=-2, a3=-1

4.8. **Geometry of matrix operations**

4.8.1. **Intuition from Two Dimentions**

4.8.1.1. Suppose A is 2x2 matrix (mapping R^2 to itself). Any such matrix can be expressed uniquely as a **stretching**, followed by a **skewing**, followed by a **rotation**

4.8.1.2. Any vector can be written as a sum scalar multiple of two specific vectors ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-aa5176fb-454d-42e0-b2fc-d9a206466fd1.png 150x25) A applied to any vector ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-cfc5fa20-a978-48b8-abfa-3931fcdb5491.png 150x22)

4.8.2. **The Determinant** det(A) is the factor the area is multiplied by det(A) is negative if it flips the plane over

4.9. **Matrix invertibility**

4.9.1. **When can you invert?** it can be done only when det != 0

4.9.2. **How to Compute the Inverse** A^(-1)*A=I ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-1f5e6d91-4f62-4d80-9cac-09128507f6d8.png 150x42)

5. **Probability** :game_die:

5.1. **Axioms of probability**

5.1.1. 2. Something always happens.

5.1.2. 1. The fraction of the times an event occurs is between 0 and 1.

5.1.3. 3. If two events can't happen at the same time (disjoint events), then the fraction of the time that _at least one_ of them occurs is the sum of the fraction of the time either one occurs separately.

5.2. **Terminology**

5.2.1. **Outcome** A single possibility from the experiment

5.2.2. **Sample Space** The set of all possible outcomes _Capital Omega_

5.2.3. **Event** Something you can observe with a yes/no answer _Capital E_

5.2.4. **Probability** Fraction of an experiment where an event occurs _P{E} є [0,1]_

5.3. **Visualizing Probability** using Venn diagram

5.3.1. **Inclusion/Exclusion** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-e837d3e0-5b34-4f36-a10c-754c0bb9146d.png 150x17)

5.3.1.1. **Intersection** of two sets ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-9a5a096a-9513-45ae-853c-5bd3e93cdbc0.png 53x17) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-5cdb3c35-a60a-43fd-8f56-c8c781b04eb0.png 73x53)

5.3.1.2. **Union** of two sets ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-2f080b42-1985-4232-9bd0-6c3ac92f5a5e.png 53x17) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-46a1d888-0c62-43a7-a545-06e3f4488c0e.png 73x53)

5.3.1.3. **Symmetric difference** of two sets ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-6275eec4-85dc-45d0-9883-cb7338ce1327.png 54x17) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-a249b0c4-3c11-4e72-8418-c4ced49ff388.png 72x52)

5.3.1.4. **Relative complement** of A (left) in B (right) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-05fd132f-e057-4f10-b458-61537fbe6ef9.png 136x23) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-d3287ec5-b393-439f-a004-2898cadcf490.png 78x57)

5.3.1.5. **Absolute complement** of A in U ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-f225a65f-bff9-496e-933f-a54714c3127a.png 101x23) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-e9d5313f-1035-4ef7-9588-2f8862e1e09e.png 91x66)

5.3.2. **General Picture** Sample Space <-> Region Outcomes <-> Points Events <-> Subregion Disjoint events <-> Disjoint subregions Probability <-> Area of subregion

5.4. **Conditional probability**

5.4.1. If I know B occurred, the probability that A occurred is the fraction of the area of B which is occupied by A

5.4.2. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-aad0631d-5583-4247-bd13-d9fb0191fd8a.png 150x41)

5.4.3. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-f46e76b7-ecd0-45dc-8791-5a58a1403848.png 150x102)

5.5. **Intuition:** The probability of an event is the expected fraction of time that the outcome would occur with repeated experiments.

5.6. **Building machine learning models**

5.6.1. Maximum Likelihood Estimation

5.6.1.1. Given a probability model with some vector of parameters (Theta) and observed data **D**, the best fitting model is the one that maximizes the probability ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-affcf3cb-ad61-41bb-8c7c-8de459c5a1c4.png 82x46)

5.7. **Bayes’ rule**

5.7.1. can be leveraged to understand **competing hypotheses**

5.7.2. odds is fraction of two probabilities i.e. 2/1

5.7.3. Posterior odds = ratio of probability of generating data * prior odds

5.7.4. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-7d941247-87b2-4d0b-8e9d-63e821f25207.png 150x34)

5.8. **Independence**

5.8.1. Two events are **independent** if one event doesn't influence the other

5.8.2. A and B are independent if P{AnB}=P{A}*P{B}

5.9. **Chebyshev’s inequality**

5.9.1. For _any_ random variable X (no assumptions) at least 99 of the time ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-25816efe-42a4-4a34-9186-97c53bcd4eb0.png 150x18)

5.9.2. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-545e7155-d01f-4514-9102-77aa5f4f7fcc.png 150x34)

5.10. **The Gaussian curve**

5.10.1. **Key Properties**

5.10.1.1. Central limit theorem

5.10.1.1.1. is a statistical theory states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population.

5.10.1.2. Maximum entropy distribution

5.10.1.2.1. Amongst **all** continuous RV with E[X]=0, Var[X]=1. H(X) Entropy is maximized uniquely for X~N(0,1)

5.10.1.2.2. Gaussian is the **most** Randon RV with fixed mean and variance

5.10.2. **General Gaussian Density**

5.10.2.1. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-34a45efb-e2b1-4238-9ca6-667bb760a735.png 129x51)

5.10.2.2. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-3538803d-482b-480c-91bf-e984001413d0.png 150x112)

5.10.3. **Standard Gaussian (Normal Distribution) Density**

5.10.3.1. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-2e469e96-9dad-4523-82c3-346b8f31fdf2.png 75x54)

5.10.3.2. E[X]=0 Var[X]=1

5.10.3.3. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-b31f0a36-a204-4b3e-a955-b122da6c73a9.png 111x68)

5.11. **Random variables**

5.11.1. is a function X that takes in an outcome and gives a number back

5.11.2. Discrete X takes only at most countable many values, usually only a finite set of values

5.11.3. **Expected Value** mean ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-249c95e1-0bec-4e54-8088-9834018fc6bf.png 120x128)

5.11.4. **Variance** how close to the mean are samples ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-075ad656-7dc7-4300-acf4-6b1ab8b9ba9e.png 89x86)

5.11.5. **Standard Deviation** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-3148c26a-bc95-4b46-846b-b525db5aff51.png 64x70)

5.12. **Entropy**

5.12.1. **Entropy** (*H*) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-42421fe4-e226-43f7-8c2e-43fcc9a8ecad.png 139x88)

5.12.2. **The Only Choice was the Units** firstly you need to choose the base for the logarithm. If the base is not 2, then *Entropy* should be divided by log2 of the base

5.12.3. Examples

5.12.3.1. **One coin** ***Entropy*** = one bit of randomness

5.12.3.1.1. H (1/2)

5.12.3.1.2. T (1/2)

5.12.3.2. **Two coins** ***Entropy*** = 2 bits of randomness

5.12.3.2.1. H

5.12.3.2.2. T

5.12.3.3. **A mixed case** ***Entropy*** = 1.5 bits of randomnes = =1/2(1 bit) + 1/2(2 bits)

5.12.3.3.1. H (1/2)

5.12.3.3.2. T

5.12.4. **Examine the Trees** if we flip n coins, then P=1/2^n \# coin flips = -log2(P)

5.13. **Continuous random variables**

5.13.1. For many applications ML works with **continuous random variables** (measurement with real numbers).

5.13.2. **Probability density function** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-8f52b038-736a-414e-9173-4166dff478f5.png 150x106)

5.13.3. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-af6a3858-40a8-4abb-bca7-f52119596669.png 87x150)

6. New Topic

7. New Topic

8. New Topic

8.1. New Topic

8.2. New Topic