trnka t1_j6583q3 wrote on January 27, 2023 at 8:15 PM

Reply to comment by InsidiousApe in [D] Simple Questions Thread by AutoModerator

If you're ingesting from an API, typically the limiting factor is the number of API calls or network round trips. So if there's a "search" API or anything similar that returns paginated data that'll speed it up a LOT.

If you need to traverse the API to crawl data, that'll slow it down a lot. Like say if there's a "game" endpoint, a "player" endpoint, a "map" endpoint, etc.

If you're working with image data, fetching the images is usually a separate step that can be slow.

After that, it you can fit it in RAM you're good. If you can fit it on one disk, there are decent libraries with each ML framework to efficiently load from disk in batches, and you can probably optimize the disk loading too.

----

What you're describing is usually called exploratory data analysis but it depends on the general direction you want to go in. If you're trying to identify people with thyroid cancer earlier, for example, you might want to compare the data of recently-diagnosed people to similar people that have been tested and found not to have thyroid cancer. Personally, in that situation I like to just train a logistic regression model to predict that from various patient properties then check if it's predictive on a held-out data sample. If it's predictive I'll then look at the coefficients of the features to understand what's going on, then work to improve the features.

Another simple thing you can do, if the data is small enough and tabular rather than text/image/video/audio is to load it up in Pandas and run .corr then check correlations with the column you care about (has_thyroid_cancer).

Hope this helps! Happy to follow up too.

InsidiousApe t1_j658w0e wrote on January 27, 2023 at 8:20 PM

This was exactly the kind of answer I was hoping for - a great place to start more research. Thanks!