I’m currently starting to pick up ML with a quant focused social scientist background. I am wondering what I am allowed to do in EDA (on the whole data set) and what not, to avoid „data leakage“ or information gain which might eventually ruin my predictive model.
Specifically, I am wondering about running linear regressions in the data inspection phase (as this is what I would often do in my previous work, which was more about hypothesis testing and not prediction-oriented).
From what I read and understand one shouldn’t really do that, because to much information might be obtained which might lead me to change my model in a way that ruins predictive power? However, in the course I am doing (Jose Portillas DS Masterclass) they are regularly looking at the correlations before separating train/test samples. But essentially linear regressions are also just (multiple/corrected) correlations, so therefore I am a bit confused where to draw the line in EDA. Thanks!
eltorrido23 t1_j6c4bwq wrote
Reply to [D] Simple Questions Thread by AutoModerator
I’m currently starting to pick up ML with a quant focused social scientist background. I am wondering what I am allowed to do in EDA (on the whole data set) and what not, to avoid „data leakage“ or information gain which might eventually ruin my predictive model. Specifically, I am wondering about running linear regressions in the data inspection phase (as this is what I would often do in my previous work, which was more about hypothesis testing and not prediction-oriented). From what I read and understand one shouldn’t really do that, because to much information might be obtained which might lead me to change my model in a way that ruins predictive power? However, in the course I am doing (Jose Portillas DS Masterclass) they are regularly looking at the correlations before separating train/test samples. But essentially linear regressions are also just (multiple/corrected) correlations, so therefore I am a bit confused where to draw the line in EDA. Thanks!