# **Math for Machine Learning**1

## 1. **Multivariate Derivatives** :silhouettes:

### 1.1. **Convexity**

1.1.1. **Derivative Condition** Hf is called positive semi-definite if ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-acae100d-fc8e-493b-9291-afcb3cab77d4.png 134x32) And this implies f is convex

1.1.2. **A Warning** When going to more complex models i.e. Neural Networks, there are many local minima & many saddle points. So they are not convex

1.1.3. Opposite is Concave ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-1772e010-d755-488e-b5c5-37686515ff9f.png 119x107)

1.1.4. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-a24ee319-4fa9-4b7a-a0ef-dbe9f31d3b12.png 128x129) the function is convex if the line between two points stays above

1.1.5. **Benefits** When a function is convex then there is * a single **unique** local minimum * no maxima * no saddle points * **Gradient descent** is guarantied to find the global minimum with ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-b326f4e0-4ce8-4eab-b582-a41fcdc3f086.png 20x29) small enough * Newton's Method always works

1.2.1. **Definitions**

1.2.1.1. **Matrix derivative** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-acdb3fb4-6aee-4aa9-b65b-1e8c8a1eb02b.png 150x69)

1.2.1.2. **Vector derivative** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-f08bf401-34a7-4428-ac05-a0b58c99dd3a.png 35x30) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-2e4f1650-0561-456e-9219-9b30301904f1.png 114x30)

1.2.1.3. **The Gradient** collection of partial derivatives ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-0724b4f0-f3ec-41ee-9af0-6e0f8e102b06.png 150x47) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-6fa44623-baeb-4fe1-9ccb-7466392a859c.png 65x46)

1.2.2. **Gradient Descent** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-d60c02e7-bccb-4c18-bbcd-348047dbf20d.png 137x26) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-a6fd63fe-b319-4eff-9c60-fd3b04f80a83.png 150x91)

1.2.4. **Visualize** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-10a51bcf-dbd3-46b8-82e1-a17db9265916.png 150x125)

1.2.5. **Key Properties**

1.2.5.1. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-3dbe8add-a456-40e6-b89c-b7c1fa83e4ff.png 21x22) points in the direction of **maximum increase**

1.2.5.2. -![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-6feb620d-ede4-4163-862b-0ea20c1a86ee.png 21x22) points in the direction of **maximum decrease**

1.2.5.3. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-c7d4b6a1-fe6f-4723-a87e-cbff3a42c39c.png 52x46) at local max & min

### 1.3. **Second Derivative**

1.3.1. **Hessian**

1.3.1.1. **2D intuition**

1.3.1.1.1. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-9125283b-447e-47ca-8025-1830d41d6acc.png 88x24) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-4cd2a446-4d46-4077-82fa-0b4224cf9b4e.png 84x106) Hf= ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-969428d0-2092-4b0f-aaae-281507412448.png 42x36)

1.3.1.1.2. **Critical points** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-3dbe8add-a456-40e6-b89c-b7c1fa83e4ff.png 21x22) =0

1.3.1.1.3. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-e4452973-0d41-494d-9d28-34b9f59b62ab.png 85x23) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-00f10beb-cf0c-4143-b8e6-0ff7a4a3898d.png 89x110) Hf= ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-0cd0749f-0f07-4965-a6c6-a2a8ec2bbae2.png 51x36)

1.3.1.1.4. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-89dab9bd-745b-441a-9bc8-be12559cd26d.png 95x24) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-2a2d6d79-ae71-4203-8430-4ddd1bfff260.png 93x114) Hf= ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-55e62a18-7ba9-40ad-9c0a-cce1f6c5c423.png 60x36)

1.3.1.2. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-b688a4ed-1871-421d-8855-3e17759ba6f3.png 105x74)

1.3.1.3. If the matrix is diagonal, a positive entry is a direction where it curves up, and a negative entry is a direction where it curves down

1.3.2. **Trace** sum of diagonal terms tr(Hf)

1.3.3. For ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-889fc2c8-fe35-4a5b-ac08-c076ca09d270.png 78x28), there is ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-bbaa1cb8-093a-4924-b062-81b51ed68b72.png 23x26) many derivatives

1.3.4. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-6851f3fd-89a7-412c-99cd-0234e3ca96cb.png 107x55)

### 1.6. **Newton Method**

1.6.1. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-4bb14722-f7fd-4382-8c44-f81635b1bf89.png 36x23) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-f8363985-1416-4dfe-94fb-5f5c66dc10ce.png 150x22)

1.6.2. The computational complexity of inverting an *n x n* matrix is not actually known, but the best-known algorithm is *O(n^2.373)* For high dimensional data sets, anything past linear time in the dimensions is often impractical, so Newton's Method is reserved for a few hundred dimentions at most.

### 1.7. **Matrix Calculus** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-926be75a-5ea3-4ef0-b484-9b09bb39f651.png 89x81)

1.7.1. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-3d516a44-b032-4d27-986f-cfdf59e1d5b2.png 69x46)

1.7.1.1. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-3fc599e0-6754-4a3b-ba78-fd00ac10d498.png 56x30)

1.7.2. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-d8a1ccab-ef13-4b9b-934d-d7edd83d73d9.png 78x46)

1.7.2.1. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-affc0ae4-4764-4014-a633-6febf7c00b79.png 70x22)

1.7.3.1. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-734b6525-4de8-4642-8803-5f415f296068.png 81x55)

1.7.4. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-69924399-1c5d-4217-a55f-041d183e7ebc.png 58x46)

1.7.4.1. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-734b6525-4de8-4642-8803-5f415f296068.png 81x55)

## 3. **Univariate Derivatives** :silhouette:

### 3.1. **Newthon's method**

3.1.1. **Idea**

3.1.1.1. Minimizing f <---> f '(x)=0

3.1.1.1.1. Now look for an algorithm to find the zero of some function g(x)

3.1.1.1.2. Apply this algorithm to f '(x)

3.1.2. **Computing the Line**

3.1.2.1. line: on (x 0, g(x 0)) slope g '(x 0) y=g '(x 0) (x-x 0)+g(x 0) solve the equation y=0

3.1.3. **Relationship to Gradient Descent** A learning rate is adaptive to f(x) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-8920a196-9865-4006-a3b1-245cb4afe83c.png 51x48)

3.1.4. **Update Step for Zero Finding**

3.1.4.1. we want to find where **g(x)=0** and we start with some initial guess x0 and then iterate

3.1.4.2. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-4109463b-fb42-42ae-987f-36e84e35f884.png 130x50)

3.1.5. **Pictorially** g(x) x such that g(x)=0 ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-98745222-28b6-4771-b050-d8b2eb1ceed6.png 150x113)

3.1.6. **Update Step for Minimization**

3.1.6.1. To minimize **f**, we want to find where **f '(x)=0** and thus we may start with some initial guess x0 and then iterate Newton's Method on **f '** to get

3.2.1. As simplistic as this is, **almost all** machine learning you have heard of use **some version** of this in the **learning process**

3.2.2. **Goal**: Minimize f(x)

3.2.3. **Issues**:

3.2.3.1. * how to pick eta

3.2.3.2. * recall that an **improperly chosen** learning rate will cause the entire optimization procedure to either **fail** or **operate too slowly** to be of practical use. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-94d81415-80c6-43b6-8252-99648704e018.png 102x105) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-086e7c75-3a97-47df-ae67-05ed586a797c.png 103x105)

3.2.3.3. * Sometimes we can **circumvent** this issue.

3.2.4. **ALGORITHM**

3.2.4.2. **2.** Iterate through ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-bad60875-5b7e-4c05-bec0-67d23e0d26a6.png 127x29) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-b326f4e0-4ce8-4eab-b582-a41fcdc3f086.png 21x30) - is learning rate

3.2.4.3. **3.** Stop after some condition is met * if the value if x doesn't change more than 0.001 * a fixed number of steps * fancier things TBD

### 3.3. **Maximum Likelihood Estimation**

3.3.1. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-47756842-f681-48eb-bdff-099bf3c63a04.png 150x84)

3.3.2. find **p** such that **Pp(D)** is maximized

### 3.4. **Second Derivative**

3.4.1. **f''(x)** shows how the slope is changing

3.4.1.1. **max** -> f '' < 0 **min** -> f '' > 0 **Can't tell** -> f '' = 0 proceed with higher derivative

### 3.5. **Derivative**

3.5.1. Can be presented as: ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-cf493c11-d9bf-426c-bb3b-59e28698cde9.png 150x35)

3.5.2. **Interpretation**

3.5.2.2. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-bfee6959-4d29-43bf-b73e-fbe765544e22.png 102x95) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-65184e01-0e67-4ac1-aa91-bc83ff410305.png 101x92)

3.5.2.3. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-4947e399-c32f-4ab6-a6e9-66a2222fa29a.png 113x98)

3.5.3. let's approximate ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-1d28e2c6-194d-4292-b30c-1d90bbcc142e.png 117x46)

3.5.4. better approximation ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-81731e0d-f6ab-4e31-a1a0-c45cd4621361.png 150x37) **f(x +є) = f(x) + f'(x)є**

3.5.5. **Rules**

3.5.5.1. **Chain Rule** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-25259bc4-94a5-4864-89f1-b7b6f411004c.png 150x25)

3.5.5.1.1. **Alternative** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-773992af-a800-4dfa-852c-b428b9644816.png 100x49)

3.5.5.2. **Product Rule** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-51fefa56-ca87-4d1f-994b-1b4aa2148ab8.png 150x18)

3.5.5.3. **Sum Rule** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-a54bc0ae-c205-4ee8-b99b-789a5746be4e.png 150x24)

3.5.5.4. **Quotient Rule** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-d603949b-8f55-47f7-b465-66fc8cbbd7d8.png 150x34)

3.5.6. **Most usable**

3.5.6.1. http://hyperphysics.phy-astr.gsu.edu/hbase/Math/derfunc.html

3.5.6.2. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-a718a9e9-08b7-42f6-812b-79b2ddd525cb.png 120x25)

3.5.6.3. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-e88e2678-788f-4769-ba6b-96c914cf7c31.png 92x30)

3.5.6.4. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-f900859c-cf26-4935-9dee-f7f73f2a1ac4.png 100x45)

## 4. **Matrices** :bookmark_tabs:

### 4.4. **Matrix product properties**

4.4.1. **Matrix Products**

4.4.1.1. **Distributativity** A(B+C) = AB +AC

4.4.1.2. **Associativity** A(BC)=(AB)C

4.4.1.3. **Not commutativity** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-7f637e31-5d6e-4b6c-90fd-ae711a15d003.png 150x53) AB!=BA

4.4.2. **The Identity Matrix** IA=A ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-dd1a7320-99ad-4293-8c4f-07de15b92153.png 150x37) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-4e23b79f-3487-427c-9d6d-b2bd82a9ec78.png 130x73) All ones on the diagonal

4.4.3. **Properties of the Hadamard Product**

4.4.3.1. **Distributativity** A(B+C) = AB +AC

4.4.3.2. **Associativity** A(BC)=(AB)C

4.4.3.3. **Commutativity** AoB=BoC

4.5.1. An (often less useful) method of multiplying matrices is element-wise **AoB**

### 4.6. **Determinant computation**

4.6.2. **Larger Matrices** m determinants of (m-1)x(m-1) matrices computer does it simplier Q(m^3) times called matrix factorizations

### 4.7. **Linear dependence**

4.7.1. det(A)=0 only if columns of A are linearly dependent

4.7.2. **Definition** lies in lower dimentional space if there are some a-s, that a1\*v1+a2\*v2+...+ak\*vk=0

4.7.3. **Example** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-2da29b28-dd84-4811-b8bb-24070f141d85.png 150x51) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-42ad9dd1-5490-4ab2-9c09-4c689756e96e.png 150x50) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-50d691d8-d256-4fa7-a001-9b599f607875.png 132x62) a1=1, a2=-2, a3=-1

### 4.8. **Geometry of matrix operations**

4.8.1. **Intuition from Two Dimentions**

4.8.1.1. Suppose A is 2x2 matrix (mapping R^2 to itself). Any such matrix can be expressed uniquely as a **stretching**, followed by a **skewing**, followed by a **rotation**

4.8.1.2. Any vector can be written as a sum scalar multiple of two specific vectors ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-aa5176fb-454d-42e0-b2fc-d9a206466fd1.png 150x25) A applied to any vector ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-cfc5fa20-a978-48b8-abfa-3931fcdb5491.png 150x22)

4.8.2. **The Determinant** det(A) is the factor the area is multiplied by det(A) is negative if it flips the plane over

### 4.9. **Matrix invertibility**

4.9.1. **When can you invert?** it can be done only when det != 0

4.9.2. **How to Compute the Inverse** A^(-1)*A=I ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-1f5e6d91-4f62-4d80-9cac-09128507f6d8.png 150x42)

## 5. **Probability** :game_die:

### 5.1. **Axioms of probability**

5.1.1. 2. Something always happens.

5.1.2. 1. The fraction of the times an event occurs is between 0 and 1.

5.1.3. 3. If two events can't happen at the same time (disjoint events), then the fraction of the time that _at least one_ of them occurs is the sum of the fraction of the time either one occurs separately.

### 5.2. **Terminology**

5.2.1. **Outcome** A single possibility from the experiment

5.2.2. **Sample Space** The set of all possible outcomes _Capital Omega_

5.2.3. **Event** Something you can observe with a yes/no answer _Capital E_

5.2.4. **Probability** Fraction of an experiment where an event occurs _P{E} є [0,1]_

### 5.3. **Visualizing Probability** using Venn diagram

5.3.1. **Inclusion/Exclusion** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-e837d3e0-5b34-4f36-a10c-754c0bb9146d.png 150x17)

5.3.1.1. **Intersection** of two sets ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-9a5a096a-9513-45ae-853c-5bd3e93cdbc0.png 53x17) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-5cdb3c35-a60a-43fd-8f56-c8c781b04eb0.png 73x53)

5.3.1.2. **Union** of two sets ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-2f080b42-1985-4232-9bd0-6c3ac92f5a5e.png 53x17) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-46a1d888-0c62-43a7-a545-06e3f4488c0e.png 73x53)

5.3.1.3. **Symmetric difference** of two sets ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-6275eec4-85dc-45d0-9883-cb7338ce1327.png 54x17) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-a249b0c4-3c11-4e72-8418-c4ced49ff388.png 72x52)

5.3.1.4. **Relative complement** of A (left) in B (right) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-05fd132f-e057-4f10-b458-61537fbe6ef9.png 136x23) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-d3287ec5-b393-439f-a004-2898cadcf490.png 78x57)

5.3.1.5. **Absolute complement** of A in U ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-f225a65f-bff9-496e-933f-a54714c3127a.png 101x23) ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-e9d5313f-1035-4ef7-9588-2f8862e1e09e.png 91x66)

5.3.2. **General Picture** Sample Space <-> Region Outcomes <-> Points Events <-> Subregion Disjoint events <-> Disjoint subregions Probability <-> Area of subregion

### 5.4. **Conditional probability**

5.4.1. If I know B occurred, the probability that A occurred is the fraction of the area of B which is occupied by A

5.4.3. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-f46e76b7-ecd0-45dc-8791-5a58a1403848.png 150x102)

### 5.6. **Building machine learning models**

5.6.1. Maximum Likelihood Estimation

5.6.1.1. Given a probability model with some vector of parameters (Theta) and observed data **D**, the best fitting model is the one that maximizes the probability ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-affcf3cb-ad61-41bb-8c7c-8de459c5a1c4.png 82x46)

### 5.7. **Bayes’ rule**

5.7.1. can be leveraged to understand **competing hypotheses**

5.7.2. odds is fraction of two probabilities i.e. 2/1

5.7.3. Posterior odds = ratio of probability of generating data * prior odds

5.7.4. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-7d941247-87b2-4d0b-8e9d-63e821f25207.png 150x34)

### 5.8. **Independence**

5.8.1. Two events are **independent** if one event doesn't influence the other

5.8.2. A and B are independent if P{AnB}=P{A}*P{B}

### 5.9. **Chebyshev’s inequality**

5.9.1. For _any_ random variable X (no assumptions) at least 99 of the time ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-25816efe-42a4-4a34-9186-97c53bcd4eb0.png 150x18)

5.9.2. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-545e7155-d01f-4514-9102-77aa5f4f7fcc.png 150x34)

### 5.10. **The Gaussian curve**

5.10.1. **Key Properties**

5.10.1.1. Central limit theorem

5.10.1.1.1. is a statistical theory states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population.

5.10.1.2. Maximum entropy distribution

5.10.1.2.1. Amongst **all** continuous RV with E[X]=0, Var[X]=1. H(X) Entropy is maximized uniquely for X~N(0,1)

5.10.1.2.2. Gaussian is the **most** Randon RV with fixed mean and variance

5.10.2. **General Gaussian Density**

5.10.2.1. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-34a45efb-e2b1-4238-9ca6-667bb760a735.png 129x51)

5.10.2.2. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-3538803d-482b-480c-91bf-e984001413d0.png 150x112)

5.10.3. **Standard Gaussian (Normal Distribution) Density**

5.10.3.2. E[X]=0 Var[X]=1

5.10.3.3. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-b31f0a36-a204-4b3e-a955-b122da6c73a9.png 111x68)

### 5.11. **Random variables**

5.11.1. is a function X that takes in an outcome and gives a number back

5.11.2. Discrete X takes only at most countable many values, usually only a finite set of values

5.11.3. **Expected Value** mean ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-249c95e1-0bec-4e54-8088-9834018fc6bf.png 120x128)

5.11.4. **Variance** how close to the mean are samples ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-075ad656-7dc7-4300-acf4-6b1ab8b9ba9e.png 89x86)

5.11.5. **Standard Deviation** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-3148c26a-bc95-4b46-846b-b525db5aff51.png 64x70)

### 5.12. **Entropy**

5.12.2. **The Only Choice was the Units** firstly you need to choose the base for the logarithm. If the base is not 2, then *Entropy* should be divided by log2 of the base

5.12.3. Examples

5.12.3.1. **One coin** ***Entropy*** = one bit of randomness

5.12.3.1.1. H (1/2)

5.12.3.1.2. T (1/2)

5.12.3.2. **Two coins** ***Entropy*** = 2 bits of randomness

5.12.3.2.1. H

5.12.3.2.2. T

5.12.3.3. **A mixed case** ***Entropy*** = 1.5 bits of randomnes = =1/2(1 bit) + 1/2(2 bits)

5.12.3.3.1. H (1/2)

5.12.3.3.2. T

5.12.4. **Examine the Trees** if we flip n coins, then P=1/2^n \# coin flips = -log2(P)

### 5.13. **Continuous random variables**

5.13.1. For many applications ML works with **continuous random variables** (measurement with real numbers).

5.13.2. **Probability density function** ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-8f52b038-736a-414e-9173-4166dff478f5.png 150x106)

5.13.3. ![image](https://coggle-images.s3.amazonaws.com/5c15061653ea3d1fb9351883-af6a3858-40a8-4abb-bca7-f52119596669.png 87x150)