where sign(r) = +1 if r ≥ 0, and sign(r) = −1 otherwise. Both the logistic regression and the perceptron can be generalized to the multi-category case. The bias term b can be absorbed into the weight parameters θ if we fix h i1 = 1. Let f (X) = h(X) θ. f (X) captures the relationship between X and Y . Because h(X) is non-linear, f (X) is also non-linear. We say the model is in the linear form because it is linear in θ, or f (X) is a linear combination of the features in h(X). The following are the choices of h() in various discriminative models.Kernel machine [12]: h i = h(X i ) is implicit, and the dimension of h i can potentially be infinite. The implementation of this method is based on the kernel trick h(X), h(X ) = K(X, X ), where K is a kernel that is explicitly used by the classifier such as the support vector machine [12]. f (X) = h(X) θ belongs to the reproducing kernel Hilbert space where the norm of f can be defined as the Euclidean norm of θ, and the norm is used to regularize the model. A Bayesian treatment leads to the Gaussian process, where θ is assumed to follow N(0, σ 2 I d ), and I d is the identity matrix of dimension d. f (X) is a Gaussian process with Cov(f (X), f (X )) = σ 2 K(X, X ).Boosting machine [22]: For h i = (h ik , k = 1, ..., d) , each h ik ∈ {+, −} is a weak classifier or a binary feature extracted from X, and f (X) = h(X) θ is a committee of weak classifiers.CART [6]: In the classification and regression trees, there are d rectangle regions {R k , k = 1, ..., d} resulted from recursive binary partition of the space of X, and each h ik = 1(X i ∈ R k ) is the binary indicator such that h ik = 1 if X i ∈ R k and h ik = 0 otherwise. f (X) = h(X) θ is a piecewise constant function.MARS [23]: In the multivariate adaptive regression splines, the components of h(X) are hinge functions such as max(0, x j − t) (where x j is the j-th component of X, j = 1, ..., p, and t is a threshold) and their products. It can be considered a continuous version of CART.Encoder and decoder: In the diagram in (2.1), the transformation X i → h i is called an encoder, and the transformation h i → Y i is called a decoder. In the non-hierarchical model, the encoder is designed, and only the decoder is learned.The outcome Y i can also be continuous or a high-dimensional vector. The learning then becomes a regression problem. Both classification and regression are about supervised learning because for each input X i , an output Y i is provided as supervision. The reinforcement learning is similar to supervised learning except that the guidance is in the form of a reward function.2.2. Descriptive models. This subsection describes the linear form of the descriptive models and the maximum likelihood learning algorithm.The descriptive models [113] can be learned in the unsupervised setting, where Y i are not observed, as illustrated by the table below: