robbsc

robbsc t1_iziz6ie wrote

I don't have the time to figure it out, but I'm pretty sure you can do it through some combination of permutations and reshapes. Play around with an NxN numpy array (e.g. np.arange(8**2).reshape(8,8)) and perform various transposes and reshapes and see what comes out. You might have to add and remove an axis at some point too.

1

robbsc t1_iycws54 wrote

For tensorflow, you have to learn to use tensorflow datasets. https://www.tensorflow.org/datasets

You could also save your dataset as an hdf5 file using h5py, then use the tensorflow_io from_hdf5() to load your data. https://www.tensorflow.org/io

Hdf5 is the "traditional" (for lack of a better word) way of loading numpy data that is too big to fit in memory. The downside is that it is slow at random indexing, so people don't use it as much anymore for training networks.

Pytorch datasets are a little easier in my opinion.

8

robbsc t1_it7zqsh wrote

One of the main reasons to use a ROC curve is for imbalanced (usually binary) datasets. A more intuitive way to look at FPR is FP/N. The curve tells you the fraction of false positives you are going to pass through for any given TPR (recall, sensitivity). If the fpr you care about is tiny, you can focus on the left side of the curve and ignore the right side.

It's also useful to sample the roc curve at recalls you care about. e.g., how many false positives am i passing through for a TPR of 95%?

Lastly, in my experience, AUC correlates highly with an improved model because most of the right side of the curve doesn't tend to change much and sits close to 1 in situations where you're just trying to improve the left side of the curve. If it doesn't, then you probably just need to change the number of thresholds you're sampling when computing auc.

Whether to use roc or precision-recall depends more on the type of problem you're working on. Obviously precision-recall is better for information retrieval, because you care about what fraction of the information retrieved at a given threshold is useful. Roc is better if you care highly about the raw number of false positives you're letting through.

3