Viewing a single comment thread. View all comments

trnka t1_j6ce4td wrote

I try not to think of it as right and wrong, but more about risk. If you have a big data set and do EDA over the full thing before splitting testing data, and intend to build a model, then yes you're learning a little about the test data but it probably won't bias your findings.

If you have a small data set and do EDA over the full thing, there's more risk of it being affected by the not-yet-held-out data.

In real-world problems though, ideally you're getting more data over time so your testing data will change and it won't be as risky.

1