Cross-Entropy and KL Divergence | Svelte Hacker News

dynm 2 days ago

In the notation of this page, the entropy H(P) is best thought of as:

"The mean number of bits to encode a member of P, assuming an optimal code."

And the KL divergence KL(P,Q) is probably best thought of as:

"The mean number of WASTED bits if you encode members of P assuming that they had come from Q."

jhrmnn 2 days ago

I and clearly many other people have run into what one could only call “KL variance”, but it doesn’t seem to have an established name.

https://mathoverflow.net/questions/210469/kullback-leibler-v...

meanpp 2 days ago

After the phrase:Manipulating the logarithms, we can also get ... the formula is incorrect, since p_j have disappeared.

\[D_{KL}(P,Q)=-\sum_{j=1}^{n}\log_2 \frac{q_j}{p_j}=\sum_{j=1}^{n}\log_2 \frac{p_j}{q_j}\]

The post is just basic definitions and simple examples for cross entropy and KL divergence.

There is a section about the relation of cross entropy and maximum likelihood estimation at the end that seems not so easy to understand but implies that the limit of a estimator applied to a sample from a distribution is the KL divergence when the sample length tends to infinity.

keepamovin 2 days ago

I often wondered about an alternative but related metric called "organization"

Entropy, in some sense would seem to measure "complexity", but it's more accurately related as "surprise" I think.

It's useful but limited (for example, you can measure the "entropy" present in a string -- of keystrokes, or text -- and determine how likely it is that it's "coherent" or "intelligent" but this is fuzzy, i.e., "too much" entropy, and you are at "randomness", too little and you are at "banality"). It seems like a more precise (but still 0 - 1 bounded) metric would be possible to measure "order" or "organization". Entropy fails at this: 0 entropy does not equal "total order". Just "total boringness" (heh :))

I considered something related to some archetypal canonical compression scheme (like LZ), but didn't flesh it out. Considering again now, what about the "self similarity" of the dictionary, combined with the diversity of the dictionary?

It's more of a "two-axis" metric but surely we can find a way to corral it into 0..1.

Very self-similar, and rather diverse? Highly organized.

Low self-similarity, and highly diverse? High entropy / highly disorganized.

Low self-similarity, and low diversity? Low entropy / high banality. I.e., simplicity heh :)

High self-similarity, low diversity - organized, but "less organized" than something with more diversity.

I don't think this is quite there yet, but there's intuitive sync with this.

Any takers???? :)

pizza a day ago

You might be interested in the complexity metric from shape dynamics involving the ratio of intra-cluster to inter-cluster distances [0], the magnitude metric for maximal diversity [1], and haecceity [2] as repurposed for the meaning of entropy as specificity.
[0] https://arxiv.org/pdf/2201.07979#subsection.5.1.2
[1] https://case.edu/artsci/math/mwmeckes/publications.html
[2] 'thisness' - in the sense that, the entropy quantifies the bits needed to pick some named option from more options
samsartor 2 days ago

Worth clarifying that you are talking about information content, not entropy. A single text file or png has information, the distribution of all possible text files or all possible pngs has entropy.
I'm not an expert, but let me brainstorm a bit here. Something closer to the specific correlation might be what you want? In vague terms it would measure how much the various bytes of a file are related to each other, by considering how much more likely they are to be given values taken together vs considering each byte individually.
But once again, having extremely high specific correlation might indicate a trival low-complexity example? I'd have to play around with it some more to get a good intuition for how it behaves.
Edit: It seems like this might also be very sensitive to parametrization. The specific correlation in terms of byte values would not be much more useful than the information content, because the marginal distributions aren't very interesting (eg over all files, how likely is byte #57 to be 0xf3 or whatever). It would be a better measure with more meaningful variables, or even something like a markov chain where you consider many short substrings.
Anyway, specific correlation (like information) measures a specific file. The expected specific correlation over all possible files gives the total correlation, which (like entropy) measures the whole distribution. Total correlation is also the KL divergence between the joint distribution and the product of the marginal distributions! Total correlation is also the same thing as mutual information, just generalized to more than two variables. And specific correlation is the generalization of pointwise mutual information.
chermi 2 days ago

We know entropy != Complexity(1). There's still no satisfactory answer for what complexity is, and the problem is only getting worse I think as more people decide to use the term however they wish. I'm partial to kolmogorov(2).
You're train of thought re. compressibility vs. organization is very in line with kolmogorov's.
1.https://www.taylorfrancis.com/books/mono/10.1201/97804295028... 2.https://en.m.wikipedia.org/wiki/Kolmogorov_complexity
daveguy 2 days ago

I'm not sure you can calculate "organization" of a sequence, in a completely generic and universally applicable sense. It seems you would already have to know something about how a sequence might be organized -- the constraints for a specific example / organization scheme. Digging through hypothetical organization that includes self reference would require an additional layer of understanding. Or in other words, wouldn't a sufficiently complex sequence be indistinguishable from random noise and the analysis would have to be outside of the expression system for at least some examples? This seems very related to the halting problem and the existence of proofs impossible to express in the base axiomatic system. It's been a long time since I studied finite state automatons, just a vague intuition from old memories.
Edit: this is not to say that there couldn't be a practical measure of organization for specific examples/types/subsets of sequences. The halting problem doesn't prevent us from analyzing program "correctness" in specific formal systems. It just prevents us from having universal ability to fully determine the answer for a generic example.

vismit2000 a day ago

Cross entropy from the first principles: https://youtu.be/KHVR587oW8I

kurikuri 2 days ago

In the first definition of D(P,Q), the author dropped a p_j within the sum.

Onavo a day ago

Now do the evidence lower bound. That one's a pain to explain.