2
$\begingroup$

I have statistics of $100$ questions which can be answered either "yes" or "no":
1) $63.3 - 36.7$ ($63.3\%$ respondents answered "yes" and $36.7\%$ answered "no")
2) $30.1 - 69.9$
...
100) $88.0 - 12.0$

That $100$ answers would be our sample.

Then I ask the same $100$ questions to $101^{st}$ respondent and get new set of answers:
1) yes
2) yes
...
100) no

What I need to do is some how calculate "correlation" value between this respondent and overall sample. Any method will be OK, I just need to get some number. Therefore there may be different "right" answers.

Thanks in advance.

  • 0
    I think what you are asking for is the [sample correlation coefficient](http://en.wikipedia.org/wiki/Correlation_and_dependence).2010-08-16
  • 0
    This is not, directly, the correlation coefficient, but you can write down some relevant measures using correlation coefficients. Asking on http://stats.stackexchange.com will get answers from statisticians.2010-08-17
  • 0
    Thanks, I'll continue my little research. Never thought it would require such an immerse into statistics :-)2010-08-17

1 Answers 1

2

[In short, any of the two quantities will satisfy your need- $d$ (easy to calculate), or a monotonic function of $d$ $P(D\lt d)$ (needs bootstrapping to calculate). Both share the property that smaller they are, bigger the chance that new point comes from surveyed data. are described below. For a detailed discussion, read on.]

The problem can essentially be reformulated as a statistical hypothesis testing problem, and we need to test if the new observation comes from the surveyed population.

I will outline the construction of one such test. For this I will assume the questions are independent.

$H_0$: New observation comes from the population. Assuming independence, the population can be characterized by a vector of probabilities $(p_1,p_2,...,p_n)$ for the $n$ questions. These $p_i$'s correspond to your calculated % of "yes".

$H_1$: New observation does not come from the population.

Suppose the new observation is $B=(B_1,...,B_n)$, where each $B_i$ is 1 if "yes" and 0 if "no" was entered.

A reasonable test statistic would be $D=\sum (B_i-p_i)^2$, which looks at the total squared deviation from the expected answers.

You can then judge if you want to label the new sample as coming from your population by evaluating $P(D\lt d)$ - if this probability is <5%, you can say with 95% confidence that it comes from the sample.

Computing the exact distribution of $D$ under $H_0$ will be challenging, to determine $P(D\lt d)$ you can use bootstrapping - i.e. simulating observations under $H_0$ a million times and checking what proportion of times your simulated $D$ is less than observed $d$.

If you need a 0 to 1 valued 'correlation' like statistic, you can look at $1-P(D\lt d)$.