Talk:Fisher information metric

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Redirect[edit]

I suspect the appropriate fate of this article is to redirect it to Fisher information, incorporate everything here into that article, and greatly expand it so it's no longer a stub, putting in the usual topics on Fisher information, which at this point are missing. Michael Hardy 01:14, 7 Apr 2004 (UTC)

  • Let's emphasize that one more time, eh? "STUB" You are yet only a student of the obvious. The *normal* way to do this is to add a msg:stub header at the top of the page.
  • This is a metric, in the differential geometry sense of the word. Read the books cited for more information. Kevin Baas 16:13, 7 Apr 2004 (UTC)

Question[edit]

Anyone know what a "statistical manifold" is? Or a "statistical differential manifold"? Melcombe (talk) 14:15, 18 September 2009 (UTC)[reply]

Statistical manifold = smooth parametric model = a finite-dimensional family of probability distributions that can be viewed as a smooth manifold = it can be smoothly parametrized by finite-dimensional θ, at least locally. stpasha » 17:40, 9 February 2010 (UTC)[reply]
And the article doesn't say anything like that, at present. Melcombe (talk) 10:54, 10 February 2010 (UTC)[reply]

Another Question[edit]

Why doesn't this article mention the Cramér-Rao bound? Seems to this reader that belongs in the first paragraph. It is clearly the most important application, and the link will lead readers to a more accessible treatment of the concept in a simpler situation. — Preceding unsigned comment added by Robroot (talkcontribs) 23:20, 1 March 2014 (UTC)[reply]

Robroot 23:22, 1 March 2014 (UTC)[reply]

Dubious[edit]

The article doesn’t define a metric. For a reference, metric is a function d:M×MR+, where M is the smooth manifold in question. The quantity g described in the article fails in many respects:

  • g takes values in the space of k×k matrices (where k is the dimension of the “parameter vector”), not in the space R+
  • g is a function of a single argument, instead of being a function of two arguments. So it's not even possible to check if it defines a symmetric function, and if it is equal to zero only when the distributions are equal
  • Using parameter θ is not adequate when talking about smooth manifolds. By definition of a manifold, it may be Riemannian-parametrized only locally, there does not exist any θ for the entire manifold.

... stpasha » talk » 16:59, 21 September 2009 (UTC)[reply]

I see that Information geometry#Fisher information metric as a Riemannian metric duplicates what is here. Melcombe (talk) 09:38, 22 September 2009 (UTC)[reply]
I tried to delete this because it is NOT dubious (and how could it be, it's a definition). Someone put it back up, not sure why. This is not a matter of opinion. The person who wrote the information below is not familiar with a Riemannian metric or information geometry. 17:31, 22 December 2009 Hotnewrelease (talk | contribs)
Standard wikipedia conventions are not to delete stuff from talk/discussion pages but to retain it for other peoples edification, and "for the record". Change the article so that it is correct, but leave the discussion intact and explain on the doscussion page why confusion has arisen or what is wrong. Also, add at the end of threads. Melcombe (talk) 17:59, 22 December 2009 (UTC)[reply]

For the record, there are MANY different notions of metric.

It doesn't define a metric in the sense you are mentioning, but for a general semi-Riemannian manifold, it does define one, see Riemannian metric. These are very well studied because of their relevance to general relativity.

g doesn't actually take a single argument, it is a matrix. The indices represent the components of a matrix (tensor). The importance is this: We can construct an infinitesimal distance between two points that is locally valid and the way this distance changes in time defines the intrinsic curvature of the space. The scalar product on of this metric is invariant with respect to reparametrizations - so no matter how we relabel our coordinate space, we still have a valid description of distance between two points.

No one ever said the parameters had to be globally valid. —Preceding unsigned comment added by 128.135.24.135 (talk) 04:01, 10 December 2009 (UTC)[reply]

I corrected the wikilink in the above. (I see it was put correctly in the article.) Unfortunately, that redirects to something that is generally incomprehensible. But then that matches the present version of this article, which totally fails to explain what it is about, or why it might be important. Melcombe (talk) 10:22, 10 December 2009 (UTC)[reply]

I reclaim my previous comment, it was erroneous.  … stpasha »  17:40, 9 February 2010 (UTC)[reply]

Clarification[edit]

In this edit, User:Melcombe asked for clarification.

I'll clarify here, rather then mess up the article: yes, the expressions are equivalent; plug one into the other and turn the crank. If you get stuck, recall that

and that the second derivative of 1 is zero. (and that the derivatives commute with the integral). linas (talk) 02:29, 12 July 2012 (UTC)[reply]

Definition is mathematically incoherent[edit]

The section "Formal definition" is not a formal mathematical definition at all, it is incoherent. Here is the section in question:

Let X be an orientable manifold, and let be a measure on X. Equivalently, let be a probability space on , with sigma algebra and probability .
The statistical manifold S(X) of X is defined as the space of all measures on X (with the sigma-algebra held fixed). Note that this space is infinite-dimensional, and is commonly taken to be a Fréchet space. The points of S(X) are measures.
Pick a point and consider the tangent space . The Fisher information metric is then an inner product on the tangent space. With some abuse of notation, one may write this as
Here, and are vectors in the tangent space; that is, . The abuse of notation is to write the tangent vectors as if they are derivatives, and to insert the extraneous d in writing the integral: the integration is meant to be carried out using the measure over the whole space X.

The tangent space to a space of measures S at a measure μ is the linear space S itself. So in the displayed formula, each tangent vector σi is a measure. Then dσi/dμ must be the (absolutely continuous part of the) Radon-Nikodym derivative of σi with respect to μ. This is not an abuse of notation; it is standard notation and has a completely rigorous definition. There are not too many d's, that is fine, no need to apologize. And the expressions σi and dσi/dμ are not both tangent vectors. The first is a tangent vector, and the second is a linear image of it that is used for defining the inner product.

That is where the problems begin. In the formula, Wikipedia is attempting to multiply two such functions and integrate them against the underlying measure. There are two problems with this:

(1) The functions may not lie in L2, so that the definition makes no sense. (The integral cannot be carried out.)

(2) If σi is singular with respect to μ, then the given expression (the absolutely continuous part of the Radon-Nykodym derivative) is too small to capture the size of the vector σi relative to μ. In some cases (for example when σ12), it could be useful to conventionally assume the integral to be infinity, and work on interpreting that. In other cases, it's like the objection in (1) but worse.

So I would not characterize this as a "formal" definition, but rather as a blackboard proposal of a definition that is interesting, but doesn't work yet.

There is a whole literature on putting both distance metrics and Riemannian metrics on spaces of measures and on Banach manifolds. The problems are also well known. For L^2 type Riemannian metrics, similar to what is being attempted here, there are some papers of Marsden and other nonlinear functional analysts starting in the 1970s. And so forth. 178.38.76.15 (talk) 09:52, 22 November 2014 (UTC)[reply]

P.S. As an example of how to do this kind of thing correctly, see Kullback–Leibler_divergence#Definition, where all the caveats are correctly explained. 178.38.76.15 (talk) 10:08, 22 November 2014 (UTC)[reply]
P.P.S. The substantive problems go away if we restrict the scope of the definition to finite probability spaces, though some problems of notation and expression remain. 178.38.157.66 (talk) 15:42, 23 November 2014 (UTC)[reply]
The category of probability spaces are precisely those spaces that have the Nikodym-Radon property; the Hilbert spaces are square-integrable, and so there is no question that the integral is convergent. I believe that the confusion about the derivative can be ascribed to the fact that it is an abuse of notation: you can get away with thinking of these as being actual "derivatives", for a little while, but then run into trouble. Instead, you have to pick a weak topology, and perform integration on that. When you do so, the use of the "d" symbol becomes non-sensical; its not terribly well-defined. That's why its an abuse of notation. The L^2 property that you are looking for is possessed by the Hilbert spaces, and the topology that you must endow these spaces with is the weak topology; the condition that something be square integrable is the same thing as saying that a Cauchy sequence converges to a finite value in the weak topology. 67.198.37.16 (talk) 18:23, 30 September 2016 (UTC)[reply]

"Change in entropy" section is dubious[edit]

From the article:

The path parameter here is time t; this action can be understood to give the change in entropy of a system as it is moved from time a to time b. Specifically, one has
as the change in entropy.

This seems unlikely. The change in entropy should be a signed quantity depending only on the endpoints, whereas the action is a positive quantity that depends on the path. Could it be the total variation of entropy along the path rather than the change of entropy? Could an inequality be true instead? 178.38.157.66 (talk) 13:58, 23 November 2014 (UTC)[reply]

Huh? If at time "a" I have rum in a bottle, and coke in a bottle, there are clearly two different endpoints for time "b": (1) the rum and coke are still in the bottles, and (2) the rum and coke are mixed in a glass. These two cases have very different entropies, and the final entropy is very heavily path-dependent on how you got to each respective endpoint. 67.198.37.16 (talk) 16:25, 30 September 2016 (UTC)[reply]
I agree with OP that this statement really seems absurd. 2nd principle of Thermodynamics states the existence of entropy as a *state function*, that is, a function on the space of probabilities. Therefore:
A quantity that is not a state function nor the derivative of a state function is the heat flux , which contributes in the created entropy. It is thus meaningful to minimize the creation of universe entropy, when moving a system interacting with a thermostat from a state a to a state b: the state of the thermostat (i.e. its internal energy) is not specified (even though its temperature is fixed), therefore the total entropy can reach different values at b depending on the heat transfers that took place.
Anyway, is undefined. Section is either misleading, meaningless or false...
As to the rum and coke there is a final state b in the bottles, an other state c in the glass, and inside my liver still makes an other state d, therefore no surprise they have different entropies. What matters is that the rum+coke entropy in state d does not depend on the time it took to drink the glass nor the way I chose to mix the drink.
Olivier Peltre (talk) 15:36, 19 November 2017 (UTC)[reply]

"As Euclidean metric" section dubious[edit]

The differential geometry in this section is incoherent.

What is the manifold here? The sphere or the ambient Euclidean space? If the sphere, it is not flat (as claimed in the paper) but round, and the y^i are not coordinates on it, and we don't have the tangent vectors that are claimed. If it is flat Euclidean space, why is the sphere mentioned?

It can all be straightened out, but the answer is surprising. The current article is trying, but failing, to express the following theorem, which appears in the Gromov paper cited:

Theorem. The Fisher information metric on the n-simplex Δ:={(x_0,...,x_n)| x_0+...+x_n=1} of probability measures on n+1 elements is isometric to the standard round measure on the positive octant of the n-sphere, up to a factor of 4.

But the theorem needs to be stated correctly. The current article says:

That is, the Fisher information metric on a statistical manifold is simply (four times) the flat Euclidean metric, after appropriate changes of variable.

This is essentially the opposite of what's true. The closest correct statement would be:

The Fisher information metric on the statistical manifold of all probability measures on a discrete space is simply (four times) the round metric on an octant of a sphere, after appropriate changes of variable.

The derivation must be totally rewritten.

The Gromov paper is probably brilliant, but it is mostly unreadable. Like Joyce, or Linear A, but with the decoding lying in the future -- it will take 100 years to figure it out. For most of its claims, the Gromov paper is not an adequate source for an encyclopedia. It is essentially a scientific program that has not been carried out! Even for the current easy theorem, it is not great; it disposes of it in about two lines, plus some playful coyness. No wonder the previous Wikipedia editor was misled.

By the way, a robot flagged this edit as "potentially unconstructive".

178.38.157.66 (talk) 15:25, 23 November 2014 (UTC)[reply]

There are seven other references given, besides Gromov, and I am pretty certain that all of them give variant derivations of the same idea: there's nothing strange going on here. First: coordinates: the y's are perfectly valid coordinates, you just have to subject them to the appropriate constraints, so that they live on the surface of the sphere. Lagrange multipliers are a standard way of doing this. I took a look at that article, and lo-and-behold, it has section: Lagrange multipliers#Example 3: Entropy that explicitly deals with the use of Lagrange multipliers for working with information entropy. To work with a sphere, just take the squares. Example 1 in that article almost does that: it special-cases to the constraint being a circle.
I see no problem describing tangent vectors on a sphere: I suspect that you find the notation confusing: the use of partial derivatives to stand for basis vectors of a tangent space is standard notation for books in Riemannian geometry, but is disconcerting if you haven't seen that notation before. Anyway, as you point out, its an isometry, between a simplex in ambient Euclidean space, and a quadrant of a sphere (also in ambient euclidean space) mapped to that simplex. There are two distinct manifolds at work, here, not one. 67.198.37.16 (talk) 16:38, 30 September 2016 (UTC)[reply]

The article is largely incoherent[edit]

There are dozens of places in this article, in most of the sections, where the mathematics is buggy or even incoherent.

Foe example:

Definition The "equivalent formulation" misses the mark. It's the wrong equivalence for the derivation given.
Relation to the Kullback–Leibler divergence Not explicit enough. Does it only work at extreme points?
Change in free entropy Criticized as incoherent on the talk page
Relation to the Jensen–Shannon divergence Notation is not understandable
As Euclidean metric Criticized as incoherent on the talk page
As Fubini–Study metric Fails to specify variables. A lot of incoherence.
Continuously-valued probabilities More incoherence. Objects not defined correctly.

The article needs a serious warning: "Needs attention from an expert." I don't know how to add it.

2A02:1210:2642:4A00:9010:30F5:E080:70AE (talk) 18:40, 7 February 2024 (UTC)[reply]