trnka t1_j6ce4td wrote on January 29, 2023 at 9:26 AM

Reply to comment by eltorrido23 in [D] Simple Questions Thread by AutoModerator

I try not to think of it as right and wrong, but more about risk. If you have a big data set and do EDA over the full thing before splitting testing data, and intend to build a model, then yes you're learning a little about the test data but it probably won't bias your findings.

If you have a small data set and do EDA over the full thing, there's more risk of it being affected by the not-yet-held-out data.

In real-world problems though, ideally you're getting more data over time so your testing data will change and it won't be as risky.