Comments

You must log in or register to comment.

CompletelyPresent t1_jeduv5n wrote

I always heard 1500 participants are required to make it valid.

But look how easy even that many can be corrupt. What if they're saying 100% of people think God is real, but the survey featured 1500 people from rural Texas. It would be heavily biased.

Source: Took 4 Statistic classes in a row when getting my MBA.

3

ItsACaragor t1_jedvhjm wrote

30 is not representative at all.

Generally the minimum is 1000 people taken at random to be at least a bit representative

1

Lordaxxington t1_jedvj61 wrote

There are 7.8 billion people in the world. Obviously, many surveys are only intended for a certain audience, and even in very general surveys there is a problem of getting that many people to take part. But the more people you ask, the more likely you are to get a general result that really reflects the average or majority answer.

Say you are friends with a lot of model train enthusiasts. You ask the first 30 people you know for their main hobby, and the answer will be appear to be overwhelmingly model trains. But expand that to the first 300 people you know, and some more common hobbies will end up being the result, but model trains might still stick out as being quite a popular answer. Expand that to 3000 people, and you'd start to see that model trains are actually a pretty niche hobby.

There can still be biased or unusual results in a survey with more people, because you have to consider how the participants were found and how diverse they are in other categories (for example, a poll on Reddit with thousands of answers will still only reflect the opinions of people who spend time on Reddit). And a smaller survey could actually have very representative answers, but this is hard to know.

There's a lot to consider in statistics. But it's generally an important factor to consider to get a wider distribution of answers.

5

ForgetTheWords t1_jedzfh1 wrote

Generally speaking, the more observations you make (e.g. survey responses), the easier it is to detect an effect. Probably what you heard is that, for the kind of effect sizes one usually sees in whatever context was being discussed, it takes ~30 responses to be reasonably sure (probably a 5% or less chance of being wrong) that the difference observed is caused by a true difference in the population and not mere chance (i.e. you just happened to get a sample where your hypothesis was true, even though it isn't true for the population).

The classic example is pulling coloured balls from a bag. How many balls do you have to pull to get a good idea of what percentage of the balls in the bag are what colour? It depends, of course, on how many balls there are and how the colours are distributed. You have to at least estimate those numbers before you decide what kind of test to do. If there are only ten balls, you could probably just do a census - i.e. look at every ball. If there are 500k balls, you'll only be able to observe a sample. But how big a sample do you need? If you expect the distribution to be ~evenly divided between two colors, you may be able to get away with only 30. If, however, you expect ~25 colours, or that some colours will show up only ~1% of the time, say, you'll need a lot more observations before you can be reasonably confident your sample resembles the population (every ball in the bag).

Bear in mind that most statistical tests assume the sample was drawn randomly. In practice, it is very hard if not impossible to randomly sample humans for a survey. So you generally will want to get more responses to make your statistical tests more powerful (more likely to distinguish a true effect) while keeping your significance level (likelihood that the effect observed is only by chance) reasonably low.

If you could get a truly random sample, you'd need fewer observations to have a good chance that your sample is representative. If it's only mostly random, there's a higher chance that any effect you observe is because of a bias in the sampling. Thus, you will probably want to be more strict in declaring that an observed effect is genuinely present in the population.

But by choosing to reject more findings that could have happened by chance, you make it harder to accept findings that are because of a genuine effect in the population. A real but small effect in the population is not easily distinguishable from a small effect in the sample caused by nonrandom sampling.

4

defalt86 t1_jedznpz wrote

There is no set minimum to be a valid sample. What matters is confidence intervals. Confidence intervals basically mean 'how much of the sample needs to align for it to be significant.'

Imagine you only asked 2 people if America is fascist, and they both said yes. Does this mean everyone thinks America is fascist, or is it just random? You have no confidence.

If you ask 30 people, "is America fascist," and 28 say yes, you can be 95% confident America is fascist.

The larger your sample, the smaller % you need to gain confidence. If you asked 1000 people, and 700 say yes, you can still be pretty confident that America is, in fact, fascist.

4

MidnightAdventurer t1_jee0yg1 wrote

Technically, you can say "Americans think America is fascist" not that it actually is. This is another common error that people make with data - You can only draw conclusions about what you actually measured and for surveys in particular, the way you ask the questions can have a big impact on the results

Asking people what they think is only measuring what they think, not measuring against an objective standard. To answer if America is actually fascist, you'd need to define some measurable parameters for what that means then collect data on those parameters

2

Plain_Bread t1_jeemogj wrote

Any sample that is truly taken at random is representative. The question is how narrow of a confidence interval you're looking for. At a sample size of 500, your 95% confidence interval for the proportion of people who answer "yes" to a question can span up to ~10%. If that's good enough for you then there's nothing wrong with that sample size.

2

redditisadamndrug t1_jef43n1 wrote

/u/shashwathj, everyone here is answering the wrong thing.

30 respondents is a rule of thumb for when you can use the normal distribution as an approximation for the binomial distribution.

If you have a survey that is asking a yes or no question, that is two options so the distribution for that is the binomial distribution. The binomial distribution is a bit of a pain to work with however but fortunately it starts to look like the normal distribution with more and more respondents. To give non-statisticians a simple threshold for this, we say 30 respondents.

The normal distribution has a simple (by mathematician standards) equation for confidence intervals and so you can quantify the uncertainty in your survey.

There are other methods for confidence intervals with smaller sample sizes but we can't teach everyone everything.

1

MidnightAdventurer t1_jeg32az wrote

Sure, you obviously have a political point to make with your example...

My point was that in a conversation about statistical method, it is really important to make it clear that you need to be really careful that the statistics you collect and examine actually support the conclusions you make as this is a really, really common mistake

1