Submitted by 4bedoe t3_yas9k0 in MachineLearning
Real_Revenue_4741 t1_itd229a wrote
Reply to comment by hellrail in [D] What things did you learn in ML theory that are, in practice, different? by 4bedoe
- This is a merely a matter of semantics. When a theory doesn’t extrapolate nicely in certain scenarios, you can state that the theory is incorrect. However, another way to view it is that there are still certain considerations that the theory is missing. It is often difficult to known exactly when a theory can and cannot be applied. Since you come from a physics background, a good example is CPT symmetry. Up until 1957, physicists believed CPT could be broadly applied to all of physical laws and until 1964 that CP symmetry cannot be violated as well. However, they were later disproven. You can say that the CPT theory was not applied correctly in those cases because they do not lie within the constraints we make today, but that is retroactively changing the story.
- Empirical risk minimization, VC dimension, and the classic bias and variance tradeoff are taught in undergrad machine learning classes and wete considered a well-established theories for a while. It goes without saying that there is no possible way to distinguish whether a scientific theory is truly infallible or may be refuted in the future.
hellrail t1_itd4jqh wrote
- It is not a matter of semantics. Your interpretation included all reasons for deviations, that are caused by a "wrong usage", for which i have made an extrem example. My interpretation does exclude these cases. Distinguishing between wrong and correct usage can be clearly pinned to the assumptions of the experiment. To pick up my example from before: a guy trying to predict fluid dynamics with maxwells eqs started with the assumptions that maxwells eqs are capable of predicting fluid dynmaics. That is objectively wrong. Especially, it CANNOT be interpreted as "his assumption is correct, maxwells eqs just lack of certain aspects necessary to describe fluid dynamics". No.
The wrong usage of a theory can be pinned down to the assumptions, and these are not a matter of semantics.
- Not knowing the class you have visited, i can only comment on this if you link me that lecture, such that i can see what really was thought there as established theory and what not.
In classes i have visited or seen in the internet, i have never seen somebody stating that its a global rule that larger models, without exception, do increase the danger of overfitting or similar. Such topics were discussed at maximum in context of "intuition", resp. The teacher just shared his own experiences. And still, thats often true.
But i am open to see the example lecture, that teaches that as a general rule explicitly, such that it has been falsified later.
[deleted] t1_itd7uvv wrote
[deleted]
Real_Revenue_4741 t1_itd9ziu wrote
- Even the assumptions themselves can be in contention. The point of the CPT example was to show that the assumptions that theories make often need to be revisited. Therefore, a deviation between theory and practice can, and often will, take the form of a change of assumptions about when the theory can be applied.
- http://www.ai.mit.edu/courses/6.867-f04/lectures/lecture-12-ho.pdf We can clearly see that in the model selection slide, it states to choose the model with the lowest upper bound on the expected error.
hellrail t1_itdde95 wrote
- I disagree. Give me an example where the assumption is just a matter of semantics.
I state that every correct (and by that i mean scientific) formulation of assumptions can be even abstracted and formalized, and even incorporated in an automated algorithm yielding the answer weather this assumption is true or not, w.r.t the theories assumptions.
Proof: take an arbritrary assumption formulation and convert it to mathmatical formulation. Then us goesels numbers to formalize.
If you say now, well the conversion to a mathmatical formulation can be ambigious, i would ask you to clearly state the assumptions in a language that is suited for a scientific discussion.
- On the model selection slide, i see its just stated that model/hyperp optimization aims at selecting optimal parameters. Thats ofc trivially true.
If you Talk about the subsequent slides, i see it introduces one idea, to get some guidance in finding the opt settings, called bayesian occams razor. Occams razor is a HEURISTICS. Thats so to say the opposite of a rule/theory.
A property of heuristics is explicitly, that it does not guarantee to yield a true or optimal solution. A heuristics can by definition not be wrong or correct. its a heuristics, a strategy that has worked for many ppl in the past and might fail in many cases. A heuristics does not claim to provide a found rule or similiar.
Now on the last slide they even address the drawbacks of this heuristics. What do you expect more?
As i expected, this is not an example of a theory stating something that deviates from reality. Its just a HEURISTIC strategy they give you at hand, when you want to start with hyperparameter finding but you have no clue how. Thats when you go back to heuristics (please wikipedia heuristics) and i bet this proposed heuristics is not the worst you can do even today, where more knowledge has been accquired.
Real_Revenue_4741 t1_itddwyd wrote
I believe you are looking at the wrong slides. Reddit did something weird with the hyperlink
hellrail t1_itde28m wrote
Then please point mento the right slide by gibing the slide number
Real_Revenue_4741 t1_itdehtz wrote
It should be from MIT (try copying/pasting the address linked above)
hellrail t1_itewyaw wrote
One thing i must add regarding the topic of presentation as "established knowledge".
The lecture you quoted, is lecture number 12. It is embedded in a course. There are of course lecture 11, 10, 9 etc. If you check these, which are also accessible with slightly midifying the given link, you see the context of this lecture. Specifically, a bunch of classifiers are explicitly introduced, and the v-dim theory on lecture 12 are still valid of these. The course does not adress deep networks yet.
So its a bit unfair to say these lecture does teach you a theory that deviates. Its does not deviate for the there introduced classifiers.
hellrail t1_itdlgvb wrote
Ok found the right one.
Well, generally i must say good example. I accepted it at least as a very interesting example to talk about, worth mentioning in this context.
Nevertheless, its still valid for all NON cnn, resnet, transformer models.
Taking into account, that its based on an old theory (prior 1990), where these deep networks have not existed yet, one might take into account its limitedness (as it doesnt try to model effects taking place during learning of such complex deep models, which hasnt been a topic back then).
So if I would be really mean, i would say u cant expect a theory making predictions about entities (in this case modern deep networks) that had not been invented yet. One could say that the v-dim theory's assumptions include the assumption of a "perfect" learning procedure (therefore exclude any dynamic effects from the learning procedure), which is still valid for decision trees, random forrest, svms, etc, which have their relevance for many problems.
But since im not that mean, i admit that this observations in these modern networks do undermine the practicability of the V-dimension view for modern deep networks of the mentioned types, and that must have been a mediocre surprise before having tried out if v-dims work for cnn/resnet/transformers, therefore good example.
Viewing a single comment thread. View all comments