I don't think you should expect any short, snappy answers because I think this is a very deep question. Here is a guess at a conceptual explanation, which I can't quite flesh out.
Our starting point is something called the principle of maximum entropy, which says that in any situation where you're trying to assign a probability distribution to some events, you should choose the distribution with maximum entropy which is consistent with your knowledge. For example, if you don't know anything and there are $n$ events, then the maximum entropy distribution is the uniform one where each event occurs with probability $\frac{1}{n}$. There are lots more examples in this expository paper by Keith Conrad.
Now take a bunch of independent identically distributed random variables $X_i$ with mean $\mu$ and variance $\sigma^2$. You know exactly what the mean of $\frac{X_1 + ... + X_n}{n}$ is; it's $\mu$ by linearity of expectation. Variance is also linear, at least on independent variables (this is a probabilistic form of the Pythagorean theorem), hence
$$\text{Var}(X_1 + ... + X_n) = \text{Var}(X_1) + ... + \text{Var}(X_n) = n \sigma^2$$
but since variance scales quadratically, the variance of $\frac{X_1 + ... + X_n}{n}$ is actually $\frac{\sigma^2}{n}$; in other words, it goes to zero! This is a simple way to convince yourself of the (weak) law of large numbers.
So we can convince ourselves that (under the assumptions of finite mean and variance) the average of a bunch of iid random variables tends to its mean. If we want to study how it tends to its mean, we need to instead consider $\frac{(X_1 - \mu) + ... + (X_n - \mu)}{\sqrt{n}}$, which has mean $0$ and variance $\sigma^2$.
Suppose we suspected, for one reason or another, that this tended to some fixed limiting distribution in terms of $\sigma^2$ alone. We might be led to this conclusion by seeing this behavior for several particular distributions, for example. Given that, it follows that we don't know anything about this limiting distribution except its mean and variance. So we should choose the distribution of maximum entropy with a fixed mean and variance. And this is precisely the corresponding normal distribution! Intuitively, each iid random variable is like a particle moving randomly, and adding up the contributions of all of the random particles adds "heat," or "entropy," to your system. (I think this is why the normal distribution shows up in the description of the heat kernel, but don't quote me on this.) In information-theoretic terms, the more iid random variables you sum, the less information you have about the result.