Study Note: Linear Discriminant Analysis, ROC & AUC, Confusion Matrix

Posted on 2019-10-19 | Edited on 2020-07-17 | In Machine Learning

Symbols count in article: 13k | Reading time ≈ 12 mins.

LDA V.S. Logistic Regression:

When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem.
If n is small and the distribution of the predictors X is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model.
Linear discriminant analysis is popular when we have more than two response classes.

Study Note: Logistic Regression

Posted on 2019-10-19 | Edited on 2020-09-13 | In Machine Learning

Symbols count in article: 9.2k | Reading time ≈ 8 mins.

Some notation: \[ \begin{align} \theta^Tx=\sum_{i=1}^n \theta_ix_i \tag{weighted sum} \\ \sigma(z)=\frac{1}{1+e^{-z}} \tag{sigmoid function} \end{align} \]

Study Note: Linear Regression Part II - Potential Problems

Posted on 2019-10-19 | Edited on 2020-07-19 | In Machine Learning

Symbols count in article: 10k | Reading time ≈ 9 mins.

Qualitative Predictors

Predictors with Only Two Levels

Suppose that we wish to investigate differences in credit card balance between males and females, ignoring the other variables for the moment. If a qualitative predictor (also known as a factor) only has two levels, or possible values, then incorporating it into a regression model is very simple. We simply create an indicator or dummy variable that takes on two possible numerical values.

and use this variable as a predictor in the regression equation. This results in the model

Study Note: Linear Regression Part I - Linear Regression Models

Posted on 2019-10-19 | Edited on 2020-07-19 | In Machine Learning

Symbols count in article: 36k | Reading time ≈ 33 mins.

Simple Linear Regression Models

Linear Regression Model

Form of the linear regression model: $y=\beta_{0}+\beta_{1}X+\epsilon$.
Training data: ($x_1$,$y_1$) ... ($x_N$,$y_N$). Each $x_{i} =(x_{i1},x_{i2},...,x_{ip})^{T}$ is a vector of feature measurements for the $i$-th case.
Goal: estimate the parameters $β$
Estimation method: Least Squares, we pick the coeﬃcients $β =(β_0,β_1,...,β_p)^{T}$ to minimize the residual sum of squares

Assumptions:

Observations $y_i$ are uncorrelated and have constant variance $\sigma^2$;
$x_i$ are ﬁxed (non random)
The regression function $E(Y |X)$ is linear, or the linear model is a reasonable approximation.

August 2019

Posted on 2019-09-07 | In Journal

Symbols count in article: 678 | Reading time ≈ 1 mins.

July 2019

Posted on 2019-08-02 | In Journal

Symbols count in article: 2.2k | Reading time ≈ 2 mins.

CS229 Note: Probability Theory - Random Variables

Posted on 2019-07-24 | Edited on 2019-12-14 | In Machine Learning , CS229

Symbols count in article: 3k | Reading time ≈ 3 mins.

Random Variables

Discrete random variables:

$X ∼ Bernoulli(p)$ (where 0 ≤ p ≤ 1): one if a coin with heads probability $p$ comes up heads, zero otherwise
- PMF: $p(x)=\begin{cases}p & x=1\\1-p & x=0\end{cases}$
- Mean: $p$
- Variance: $p(1-p)$
$X ∼ Binomial(n, p) $ (where 0 ≤ p ≤ 1): the number of heads in $n$ independent flips of a coin with heads probability $p$.
- PMF: $p(x)=\left(\begin{array}{c}n\\ x\end{array}\right) p^x(1-p)^{n-x}$
- Mean: $np$
- Variance: $np(1-p)$
$X ∼ Geometric(p) $(where p > 0): the number of flips of a coin with heads probability $p$ until the first heads.
- PMF: $p(x)=p(1-p)^{x-1}$
- Mean: $\frac{1}{p}$
- Variance: $\frac{1-p}{p^2}$
$X ∼ Poisson(λ)$ (where λ > 0): a probability distribution over the nonnegative integers used for modeling the frequency of rare events.
- PMF: $p(x)=e^{-\lambda}\frac{\lambda^x}{x!}$
- Mean: $\lambda$
- Variance: \[\lambda\]
- Properties: Poisson random variable may be used to approximate a binomial random variable when the binomial parameter n is large and p is small.

Spark SQL & DataFrame, SparkETL

Posted on 2019-07-09 | In Big Data

Symbols count in article: 5.7k | Reading time ≈ 5 mins.

SQL and DataFrame

import pyspark
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf

A SparkSession can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. To create a SparkSession, use the following builder pattern:

getOrCreate: Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder

1 2	conf=SparkConf().set("spark.python.profile", "true") spark=SparkSession.builder.master("local").appName("wordcount").config(conf=SparkConf()).getOrCreate()

createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)

Creates a DataFrame from an RDD, a list or a pandas.DataFrame.

Hadoop MapReduce: Tuning Distributed Storage Platform with File Types

Posted on 2019-07-09 | Edited on 2019-07-10 | In Big Data

Symbols count in article: 13k | Reading time ≈ 12 mins.

Data modeling and file formats

There is a mismatch between terms used to define business tasks and terms used to describe what HDFS is.

Data modeling and data management are concerned with these issues.

June 2019

Posted on 2019-07-09 | Edited on 2019-11-10 | In Journal

Symbols count in article: 2k | Reading time ≈ 2 mins.