Revisiting Logistic Regression

When picking a baseline for CTR prediction, the candidates are many. Gradient Boosting, Neural Networks, and Logistic Regression. Among them, LR is still often selected as the baseline. There are reasons for that.

This post revisits those reasons from first principles: what LR is, and why it is so often chosen for this role.

Three Properties

The reasons LR has long served as the baseline in CTR prediction come down to three.

Lightweight. The model is a single dot product. Training and inference both scale linearly with the number of features.

Interpretable. Every coefficient directly indicates “how much this feature contributes to the outcome.”

Probability output. It outputs values between 0 and 1. In ads, you multiply those directly against a bid.

The rest of the post explains why these three are “structurally” inherent to LR.

From Linear to Sigmoid

The most direct way to understand Logistic Regression is to start from linear regression.

Linear regression outputs a weighted sum of the inputs.

$$ z = w \cdot x + b $$

The problem is that $z$ ranges over all real numbers. To produce a probability like CTR, the output must lie between 0 and 1. Linear regression does not guarantee that.

The sigmoid function solves this.

$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

Sigmoid smoothly compresses the entire real line into $(0, 1)$. No matter how large the input, it approaches 1; no matter how small, it approaches 0. Pass the output of linear regression through sigmoid, and you get a probability.

This simple composition is all there is to Logistic Regression: a linear model with a probability layer on top.

One thing worth noting. The probability output is nonlinear, but the decision boundary, the surface that separates the two sides at probability 0.5, remains linear. The hyperplane $w \cdot x + b = 0$ is itself the boundary. LR is “a linear classifier with probabilities bolted on.”

log-loss

Once the model structure is fixed, training becomes “finding good $w$ and $b$.” We need a criterion for “good.”

Linear regression uses MSE. LR does not. Why?

LR’s output is a probability. There is a more suitable choice of loss for probabilistic models: log-loss (a.k.a. cross-entropy).

$$ L = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log \hat{y}_i + (1 - y_i) \log (1 - \hat{y}_i) \right] $$

When the label is 1, the loss shrinks as $\log \hat{y}$ grows; when the label is 0, the loss shrinks as $\log(1 - \hat{y})$ grows. The closer the predicted probability gets to the truth, the closer the loss gets to zero.

Log-loss is convex for LR. No local minima. The optimization converges to the global optimum. This property is the mathematical reason LR trains quickly on large-scale data.

The Structural Roots of the Three Faces

The three characteristics from the overview, lightweight, interpretable, probability output, all follow from the structure above.

Lightweight

A trained LR model is ultimately a weight vector $w$ and a bias $b$. Inference is one dot product and one sigmoid. Whether you have a million features or ten million, the computation scales linearly with the feature count. Compared to the many multiplications and nonlinearities in tree ensembles or neural networks, LR requires far less computation.

Interpretable

A coefficient $w_i$ means “when feature $i$ increases by one unit, the log-odds shift by $w_i$.” The sign indicates direction; the magnitude indicates influence. When you want to know “which feature contributes positively to clicks” in the ad domain, LR answers with a single table of coefficients. This satisfies the accountability requirements on the operations side.

Probability Output

Many classifiers output only a ranking score. LR outputs a calibrated probability. Ad expected-value math requires multiplying that number directly: predicted CTR × bid = expected revenue. A score that is not a probability cannot be used directly in the bidding formula.

CTR Prediction

CTR prediction as a problem has three characteristics.

Sparse. Most features are one-hot-encoded categoricals. Out of millions of dimensions, only a handful are 1; the rest are 0.

High-dimensional. The cross-product of ad, user, and context spreads across millions to hundreds of millions.

Large-scale. Training data accumulates in large daily volumes.

LR aligns with all three. The dot product of a sparse vector only needs to touch the non-zero entries, so the computation scales with the actual count of populated features, not the raw dimensionality. Training is easy to distribute via the SGD family. Inference fits inside the tight latency budget of real-time bidding.

When bringing up a CTR model for the first time, these characteristics become decisive. You need to establish a baseline quickly, covering the training pipeline, serving, and monitoring, and validate the entire lifecycle first. A more complex model delays that validation itself.

Limits and What Comes Next

Having seen why LR serves as the baseline, we should also see why it is eventually replaced.

The biggest limit is the absence of nonlinear interactions. Products of features, conditional effects, complex combinations. LR cannot discover those on its own. A human has to define them in advance through feature engineering. As feature combinations grow, the engineering cost increases and operations become constrained by feature-design reviews.

So when do you move on? When data and operational headroom reach a point “feature engineering can no longer absorb.” Gradient Boosting Decision Trees learn interactions on their own. Neural networks go further, converting high-cardinality categoricals into continuous vectors through embeddings. Both directions address exactly LR’s limits.

That said, LR remains a reasonable starting point. Without a baseline, if you start with a complex model, you cannot distinguish the model’s contribution from the pipeline’s. The numbers LR provides become the reference line for every comparison that follows.

Closing

Choosing the old model had its reasons.

Those reasons are in the structure. The composition of a linear model and sigmoid, the convexity of log-loss, the efficiency in sparse, high-dimensional spaces. Together, these three keep LR as the baseline for CTR prediction.

Even when the time comes to move to the next model, the numbers LR provided remain as the baseline.

Three Properties#

From Linear to Sigmoid#

log-loss#

The Structural Roots of the Three Faces#

Lightweight#

Interpretable#

Probability Output#

CTR Prediction#

Limits and What Comes Next#

Closing#