Viewing a single comment thread. View all comments

vannak139 t1_j7w5otz wrote

So, the simple strategy here, which kind of ignores your variable length objects, is to simply classify CNN receptive fields directly, and then Max Pool the multiple classification frames.

So, lets say that your sequence is 1024. You build a CNN that has a receptive field of 32, and a stride of 16. This network applied to the sequence will offer something like 63 "frames". Typically, the CNN would expand this network representation up with a large number of channels, take the GlobalMaxPooling to merge these frame's information, and then classify the sample.

Instead, you should classify the frames directly, meaning your output looks like 63 separate sigmoid classifications associated with regions of the signal. Then, you simply take the maximum of each classification likelihood, and use this for your image-level classification.

After training, you can remove the GlobalMaxPooling layer, and look at the segment classifications directly.

1