Alan Barzilay
Impractical most of the time.
Wastes too much time in low performance regions.
Same issues as Grid Search, but a step in the right direction.
Tip: set multivariate to True
Random Exploration
Partitioning the Search Space and Parzen Estimation
l(x) groups the top γ% performing hyperparameter combinations.
l(x) is "good" and g(x) is "bad".
Determining the Next “Best” Hyperparameters to Test
Test the maximum of g(x)/l(x), rinse repeat.
The Parzen window estimator is a statistical model for density estimation, which is also called the kernel density estimator (KDE). In the TPE algorithm, Parzen window estimators are used to estimate the densities of good hyperparameters \(l(x)\) and bad hyperparameters \(g(x)\).
Intuitively speaking, the Parzen window estimator (in right figure) is a smoothed version of the histogram of relative frequency (in the left figure).
Q: What if it gets stuck in a local minima?
A: Restart it
You could also Warm Start it, use margins, use and adaptive learning rate, diagonalize the covariance matrix or use any other of its many variations (most of which supported natively by Optuna).
Not as flexible or fast as TPE (e.g. there is no support for categorical variable or constraints)
This is a big one, there are entire courses on the subject.
Let's focus on how it contrasts with other samplers instead of really explaining the algorithm minute inner workings.
The biggest difference between GP and other samplers is on the approach of looking for new points to test.
With TPE and CMA-ES we take a number of random trials as a warm start and based on them we try new points we believe will perform well.
GP in contrast, utilizes an acquisition function to balance the trade-off between exploration (testing new regions of the state space) and exploitation (sampling new points from a known good region).
Every time GP tests a new point in its exploration state, it will pick a set of hyper-parameters that maximizes "information" on the search space and minimizes uncertainty regions so that it can exploit good regions with high confidence.
However, this process is extremely computationally expensive as the number of trials increases being \(O(d^3)\), where \(d\) is the number of complete trails.
That's why a common strategy is to use it only for the initial exploration of the search space where it is still computationally viable and then switch to another sampler.
There is a lot of good materials online (and a lot of imprecise ones) so I won't further add to the noise. For the interested reader, here are some curated and rigorous references on the subject.
For those looking to fine tune their usage of GP, the key "levers" are kernels (or covariance functions), acquisition functions and priors.
Steps:
\(P_t\) generates \(Q_t\) and together they form \(R_{t+1}\), from which \(P_{t+1}\) will be selected
Q: How to ensure "genetic" diversity?
A: Crowding Distance
Q: How do we select points from the last layer?
A: Crowding Distance
Q:How do we break a rank tie in the tournament selection?
A: Crowding Distance
With more than 3 dimensions we start to struggle with:
With NSGA II in particular:
The proportion of non-dominated solutions in a randomly chosen set of objective vectors becomes exponentially large with an increase of number of objectives.
The non-dominated solutions occupy most of the population slots.
Splits the objective space with vectors (usually uniformly distributed) that define reference directions W
The split is made in a manner to balance the population proximity across reference directions.
\(S\): "Previous" Pareto fronts
\(F_L\): Frontier to be split
\(\rho_i\): number of points associated with direction
\(W_i\): Reference Direction \(i\)
Q: What if <insert niche use case or clever heuristic>?
A: NSGA III is a variation of NSGA II and both have a ton of variations, which have variations of their own. I suggest taking a look at the existing literature but be advised that you will probably have quickly diminishing returns. Maybe try another family of algorithms?
Q: Could you further elaborate on reference directions?
A: I could, but this is outside of the scope of this presentation. The fact that I find the topic extremely boring is totally unrelated and bears no weight in this decision.
Q: What about NSGA I?
A: 2 is bigger than 1 so obviously NSGA I sucks and doesn't matter, stop asking stupid questions
Q: What about those reference directions?
A: Here are some cool pictures and quotes, plz leave me alone, I'm begging you
emphasizes population members that are non-dominated, yet close to a set of supplied reference points. LISPECTOR, Clarice
(...) we use Das and Dennis’s [48] systematic approach that places points on a normalized hyper-plane – a (M − 1)- dimensional unit simplex – which is equally inclined to all objective axes.
DENNIS & DAS... i guess?
Riesz s-Energy
It chooses the "ideal" sampler for your conditions and swaps between them in a smart and efficient manner. However, as of the moment of writing it isn't battle tested.
while budget >0:
if multi-objective:
if conditional_params → TPE
elif few_trials_completed: # few trials: <250
if ≥4 objectives or categorical → TPE
else → GP
else:
if <4 objectives → NSGA-II
else → NSGA-III
else:
if categorical or conditional → TPE
elif few_trials_completed → GP
elif no_constraints → CMA-ES
else → TPE
budget-=1