Submitted by AutoModerator t3_10cn8pw in MachineLearning
trnka t1_j6ce4td wrote
Reply to comment by eltorrido23 in [D] Simple Questions Thread by AutoModerator
I try not to think of it as right and wrong, but more about risk. If you have a big data set and do EDA over the full thing before splitting testing data, and intend to build a model, then yes you're learning a little about the test data but it probably won't bias your findings.
If you have a small data set and do EDA over the full thing, there's more risk of it being affected by the not-yet-held-out data.
In real-world problems though, ideally you're getting more data over time so your testing data will change and it won't be as risky.
Viewing a single comment thread. View all comments