Submitted by fromnighttilldawn t3_10yfp35 in MachineLearning

I was just looking around at some paper published by statisticians, I couldn't help but notice that the flavor of their research is vastly different. For example, one researcher wrote about a dozen paper on LASSO alone over the span of a decade, whereas LASSO is just given a power point slide worth of attention in ML. Why is there such a disparity and a divergence in the aim of these disciplines?

Are there some good critique of these research fields from each other's perspective (not just on the technical aspects)? Perhaps by someone who works in both?

41

Comments

You must log in or register to comment.

currentscurrents t1_j7xv6j3 wrote

Stats is tremendously useful, especially when your dataset is small by ML standards. Basically every scientific paper relies on statistics to tell you whether or not their result is meaningful.

ML is great when you have millions of data points, but when you only have a hundred it's not going to help you.

6

trutheality t1_j7xvn75 wrote

Actually the opposite. Stats is how you design studies, which is what governments, the economy, pharma, the medical field, and most sciences run on.

ML is just used for predictive modeling in low-stakes situations and fun tech demos.

3

currentscurrents t1_j7y4073 wrote

>Right now basically all progress is with large models,

You mean all progress... in machine learning. A lot of scientific fields necessarily must make do with a smaller number of data points.

You can't test a new drug on a million people, especially in early phase trials. Even outside of medicine, you may have very few samples if you're studying a rare phenomena.

Statistics gives you tools to make limited conclusions from small samples, and also measure how meaningful those conclusions actually are.

6

Jemimas_witness t1_j7y68en wrote

This is only correct for certain problems, like everything it has best use cases. When you only have a hammer everything looks like a nail.

In medicine the backbone of clinical trial results that change the field relies often on 2000-3000 patients (datapoints) and often groundbreaking achievements in medical practice are made by simple statistics and simple methods. Go to the New England journal of medicine and pick any trial and the weight of their conclusions are based off of survival functions, hazard ratios, and chi squared statistics. Then go look at the funding section - these projects are funded by millions. The only disciplines in medicine with ML datapoints are epidemiology and claims level data which strays way into econometrics.

I myself study rare diseases as well as AI/ML applications in medicine and for some projects I’d be stoked to get 80 patients because there just simply aren’t that many around.

2

sunbunnyprime t1_j7y86w7 wrote

Good question.

An ML Researcher is typically trying to find models which are more powerful in terms of output behavior - whether that be predictive power, generative ability etc.

A Statistical Researcher is typically trying to understand the dataset, the underlying generative distribution, and really dig into what the model’s innards are saying about the data and what you can conclude from it. They’re more likely to want to extract insight about the data itself.

Statisticians tend to be more rigorous about data and more well grounded in my experience, while ML Scientists tend to want to push boundaries and be the person who’s read the latest ML journal piece.

There’s so much you can say and know about something as simple as linear regression. There’s really a lot of fascinating math in there that goes so much deeper than you might expect.

If you’re interested in just using models to predict, there’s not that much of interest in a linear model. If you really want to know what meaning you can extract from what’t going on inside - exactly why it learns the coefficients it does, what the learning dynamics are, what the results mean etc - then you might end up writing 10 papers on Lasso.

Both sides are valid. Most ML scientists suck at their jobs I must say though.

31

Ulfgardleo t1_j7y8hdg wrote

The difference between stats and ml is as large as between math and applied math. They aim to answer vastly different questions. In ml you don't care about identifiability because you don't care whether there is a gene among 2 millions that cause a specific type of cancer. This is not what ml is about. In ML you also very rarely care about tail risk (you should) and almost nothing about calibration (you really should). And identifiability is out of the window as soon as you use neural networks and that prevents you from interpreting your models.

22

WikiSummarizerBot t1_j7y9nn5 wrote

All models are wrong

>All models are wrong is a common aphorism in statistics; it is often expanded as "All models are wrong, but some are useful". The aphorism acknowledges that statistical models always fall short of the complexities of reality but can still be useful nonetheless. The aphorism originally referred just to statistical models, but it is now sometimes used for scientific models in general. The aphorism is generally attributed to the statistician George Box.

^([ )^(F.A.Q)^( | )^(Opt Out)^( | )^(Opt Out Of Subreddit)^( | )^(GitHub)^( ] Downvote to remove | v1.5)

1

AdFew4357 t1_j7yafw0 wrote

Statisticians care about inference. ML scientists care about the model specifically.

−9

I-am_Sleepy t1_j7ybb41 wrote

I don’t think ML researcher didn’t care about model calibration or tail risks. Just it often doesn’t came up in experimental settings

It also depends on the objective. If your goal is regression or classification, then tail risk and model calibration might be necessary as supporting metrics

But for more abstract use case such as generative modeling, it is debatable if tail risk and model calibration actually matter. For example GANs model can experience mode collapse such that the generated data isn’t as diverse as the original data distribution. But it doesn’t mean the model is totally garbage either

Also I don’t think statistics and ML is totally different, because most of statistical fundamentals is also ML fundamentals. And such many of ML metrics is directly derive from fundamental statistics and / or related fields

13

Ulfgardleo t1_j7yd02x wrote

You are right, but the point I was making that in ml in general those are not of high importance and this already holds for rather basal questions like:

"For your chosen learning algorithm, under which conditions holds that: in expectation over all training datasets of size n, the Bayes risk is not monotonously increasing with n"

One would think that this question is of rather central importance. Yet no-one cares, and answering this question is non-trivial for linear classification already. Stats cares a lot about this question. While the math behind both fields is the same, (all applied math is a subset of math, except if you people who identify as one of both) the communities have different goals.

6

Any_Geologist9302 t1_j7yg7wn wrote

That’s kind of an odd question because many statisticians are actively doing research in ML.

3

jimmymvp t1_j7yubak wrote

A pretty famous stats professor once told me that he should've switched to ML a long time ago. Now he does ML research, obviously very rigorous. He said that stats is making up questions that are to a large extent not practically useful.

8

canbooo t1_j7z0lku wrote

I agree with the size of the difference yet disagree with the examples as there is ml research considering all 3 (causal ml, conformal ml/predictions/forecasting, AI safety, reliability etc.) I think the difference is more like deduction and induction in a sense, meaning the process of finding the answers are different. Since finishing pooping on corporate time, I will keep this short.

ML: Data -> Method -> Hypothesis -> Answers

Statistics: Hypothesis -> Method -> Data -> Answers

This may be too simplistic and please propose a better distinction but do not postulate that ML does not care about things statistics do.

0

Appropriate-Code-940 t1_j7z700v wrote

A very simple idea, may be not correct. ML is more data driven. Statistics is more hypothesis driven. Like 2 different streams, they joint to the same river, and can not be separated again.

1

ml-anon t1_j7z97fq wrote

You will find the same thing in ML too and at some point folks might find it quaint that people spent their whole careers dicking about with convnets when they are reduced to a historical footnote by whatever comes after Transformers.

1

jimmymvp t1_j806dx2 wrote

Just communicating what I've heard. Nevertheless, I think the whole interpretable ML community (at the very least) would disagree with you on this one :). Reducing ML to "plug and chug" is well... Speaks for itself :D

3

AdFew4357 t1_j806plm wrote

The whole landscape of ML research is a hunt to chase SOTA by tweaking an architecture here or using a different optimizer there and then squeezing out 0.2% accuracy on some well known imaging dataset in an attempt to churn out papers. That’s not science if you ask me.

−1

slashdave t1_j80hs32 wrote

Different goals and different tools

1

OkCandle6431 t1_j811ol6 wrote

Where I'm at 'statistics' is what me and my co-workers call what we do, and 'machine learning' is what goes in the grant application. I'm sure this differs across regions/faculty/industry/whatever.

1

jimmymvp t1_j83v503 wrote

I'm not sure if you have a good overview of ML research if this is your claim. Sounds like you've read too many blog posts on transformers. I'd suggest going through some conference proceedings to get a good overview, there's some pretty rigorous (not just stats) stuff out there. I agree though that there is a substantial subset of research in ML that works towards tweaking and pushing the boundaries of what can be achieved with existing methods, which is for me personally exciting to see! A lot of cool stuff came out of scaling up and tweaking the architectures.

2

sunbunnyprime t1_j8bpqov wrote

Most ML scientists aren’t actually fluent in the application of the algorithms they use. They have superficial understanding, they’re slow and buggy programmers, write slow code, spend months working on models that should take a few days to put together, overindex on hyperparam selection and tuning, playing with new algorithms, and don’t know how to validate their models and end up deploying garbage that often is literally no better than a coin flip. But they’re great at convincing people that they’re right on the cusp of solving a really big problem and adding a ton of value which buys them enough time to fart around for a few years and then get another job with a 30% raise and then do it all over again.

−2