Kamel's Notes

Code changes world!


    • Home

    • ML

    • Big Data

    • Projects

    • Journal

    • About

Study Note: Linear Discriminant Analysis, ROC & AUC, Confusion Matrix

Posted on 2019-10-19 | Edited on 2020-07-17 | In Machine Learning
Symbols count in article: 13k | Reading time ≈ 12 mins.

LDA V.S. Logistic Regression:

  1. When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem.
  2. If n is small and the distribution of the predictors X is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model.
  3. Linear discriminant analysis is popular when we have more than two response classes.
Read more »

Study Note: Logistic Regression

Posted on 2019-10-19 | Edited on 2020-09-13 | In Machine Learning
Symbols count in article: 9.2k | Reading time ≈ 8 mins.

Some notation: \[ \begin{align} \theta^Tx=\sum_{i=1}^n \theta_ix_i \tag{weighted sum} \\ \sigma(z)=\frac{1}{1+e^{-z}} \tag{sigmoid function} \end{align} \]

Read more »

Study Note: Linear Regression Part II - Potential Problems

Posted on 2019-10-19 | Edited on 2020-07-19 | In Machine Learning
Symbols count in article: 10k | Reading time ≈ 9 mins.

Qualitative Predictors

Predictors with Only Two Levels

Suppose that we wish to investigate differences in credit card balance between males and females, ignoring the other variables for the moment. If a qualitative predictor (also known as a factor) only has two levels, or possible values, then incorporating it into a regression model is very simple. We simply create an indicator or dummy variable that takes on two possible numerical values.

and use this variable as a predictor in the regression equation. This results in the model

Read more »

Study Note: Linear Regression Part I - Linear Regression Models

Posted on 2019-10-19 | Edited on 2020-07-19 | In Machine Learning
Symbols count in article: 36k | Reading time ≈ 33 mins.

Simple Linear Regression Models

Linear Regression Model

  • Form of the linear regression model: \(y=\beta_{0}+\beta_{1}X+\epsilon\).

  • Training data: (\(x_1\),\(y_1\)) ... (\(x_N\),\(y_N\)). Each \(x_{i} =(x_{i1},x_{i2},...,x_{ip})^{T}\) is a vector of feature measurements for the \(i\)-th case.

  • Goal: estimate the parameters \(β\)

  • Estimation method: Least Squares, we pick the coefficients \(β =(β_0,β_1,...,β_p)^{T}\) to minimize the residual sum of squares

Assumptions:

  • Observations \(y_i\) are uncorrelated and have constant variance \(\sigma^2\);
  • \(x_i\) are fixed (non random)
  • The regression function \(E(Y |X)\) is linear, or the linear model is a reasonable approximation.
Read more »

August 2019

Posted on 2019-09-07 | In Journal
Symbols count in article: 678 | Reading time ≈ 1 mins.

 

Read more »

July 2019

Posted on 2019-08-02 | In Journal
Symbols count in article: 2.2k | Reading time ≈ 2 mins.

 

Read more »

CS229 Note: Probability Theory - Random Variables

Posted on 2019-07-24 | Edited on 2019-12-14 | In Machine Learning , CS229
Symbols count in article: 3k | Reading time ≈ 3 mins.

Random Variables

Discrete random variables:

  • \(X ∼ Bernoulli(p)\) (where 0 ≤ p ≤ 1): one if a coin with heads probability \(p\) comes up heads, zero otherwise

    • PMF: \(p(x)=\begin{cases}p & x=1\\1-p & x=0\end{cases}\)

    • Mean: \(p\)
    • Variance: \(p(1-p)\)

  • $X ∼ Binomial(n, p) $ (where 0 ≤ p ≤ 1): the number of heads in \(n\) independent flips of a coin with heads probability \(p\).

    • PMF: \(p(x)=\left(\begin{array}{c}n\\ x\end{array}\right) p^x(1-p)^{n-x}\)

    • Mean: \(np\)
    • Variance: \(np(1-p)\)

  • $X ∼ Geometric(p) $(where p > 0): the number of flips of a coin with heads probability \(p\) until the first heads.

    • PMF: \(p(x)=p(1-p)^{x-1}\)

    • Mean: \(\frac{1}{p}\)
    • Variance: \(\frac{1-p}{p^2}\)

  • \(X ∼ Poisson(λ)\) (where λ > 0): a probability distribution over the nonnegative integers used for modeling the frequency of rare events.
    • PMF: \(p(x)=e^{-\lambda}\frac{\lambda^x}{x!}\)
    • Mean: \(\lambda\)
    • Variance: \[\lambda\]
    • Properties: Poisson random variable may be used to approximate a binomial random variable when the binomial parameter n is large and p is small.
Read more »

Spark SQL & DataFrame, SparkETL

Posted on 2019-07-09 | In Big Data
Symbols count in article: 5.7k | Reading time ≈ 5 mins.

SQL and DataFrame

1
2
3
4
import pyspark
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf

A SparkSession can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. To create a SparkSession, use the following builder pattern:

getOrCreate: Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder

1
2
conf=SparkConf().set("spark.python.profile", "true")
spark=SparkSession.builder.master("local").appName("wordcount").config(conf=SparkConf()).getOrCreate()

createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)

Creates a DataFrame from an RDD, a list or a pandas.DataFrame.

Read more »

Hadoop MapReduce: Tuning Distributed Storage Platform with File Types

Posted on 2019-07-09 | Edited on 2019-07-10 | In Big Data
Symbols count in article: 13k | Reading time ≈ 12 mins.

Data modeling and file formats

There is a mismatch between terms used to define business tasks and terms used to describe what HDFS is.

Data modeling and data management are concerned with these issues.

Read more »

June 2019

Posted on 2019-07-09 | Edited on 2019-11-10 | In Journal
Symbols count in article: 2k | Reading time ≈ 2 mins.

 

Read more »
12345

Nancy Yan

44 posts
8 categories
34 tags
RSS
© 2021 Kamel Chehboun | 518k