I'm an embedded SW dev who had to help a company once to optimize their data pipeline so they could do computer vision on an edge device (an Nvidia Jetson, in case you were curious).

I'm wondering, is this a common issue that companies have? I've heard that ML inference is starting more and more to move to the edge devices instead of being run on the cloud. How do companies deal with having to optimize everything to run on a low-power, low-RAM device instead of the usual power hungry desktops or cloud services?

Comments

You must log in or register to comment.

konze t1_irzfxlz wrote on October 12, 2022 at 4:37 AM

I contribute to our group who is working exactly on this. Currently, it is quite a mess because each HW vendor provides its own tooling for deploying on their device which leads to a lot of problems (e.g. missing support for certain layers). One of the most promising tools for edge deployment is TVM together with Network Architecture Search (NAS) where the network is tailored for a specific use case and the available resources.

muunbo OP t1_is163e3 wrote on October 12, 2022 at 3:33 PM

Interesting, hadn't heard of TVM before! I'm wondering, did you come across cases in your work where it wasn't the model that was the worst bottleneck but the pre-processing / data pipeline that actually needed to be optimized? I had one experience like that so just wondering how common it is

konze t1_is2i187 wrote on October 12, 2022 at 8:43 PM

No, the data already comes in the “correct” format as a data stream from a sensor. The DNN models are trained to work on that data stream, on an edge device you usually don’t have the performance to pre process anything, except for an FFT which is a fixed function accelerator that can process the data in real time.

Alors_HS t1_is0f9h0 wrote on October 12, 2022 at 12:10 PM

I had to deploy my models on nvidia jetson / tx for my last job, 1.5y ago.

In these use cases there is a lot of optimisation to do. A list of available methods: pruning, mixed precision training/inference, quantization, CUDA/onnx/nvidia optimization, training models that perform on lower resolution data via knowledge distillation from models that trained on higher res data...

Look around, this is on the top of my head from a bit of time ago. There is plenty of resources now for inference on the edge.

muunbo OP t1_is169qt wrote on October 12, 2022 at 3:34 PM

Cool, how did you learn all those techniques? And how did you determine which one was the major cause of too much memory usage / too slow inference time?

Alors_HS t1_is1871l wrote on October 12, 2022 at 3:47 PM

Well, I needed to solve my problem so I looked at papers / software solutions. Then it was a slow process of many iterations of trials and errors.

I couldn't tell you what would be the best for your use case tho. I am afraid it's been too long for me to remember the details. Beside, each method may be more or less effective according to good results on the metrics/inference time or the method and means of training that you can afford.

I can give you a tip : I initialized my inference script only once per boot, and then put it in "waiting mode" so I wouldn't have to initialize the model for each inference (it's the largest cause of losing time). Then upon receiving a socket message, the script would read a data file, do an inference pass, write the results in an another file, delete/move the data to storage and wait for the next socket message. It's obvious when you think about it that you absolutely don't want to call/initialize your inference script once per inference, but, well, you never know what people think about :p

muunbo OP t1_is2atui wrote on October 12, 2022 at 7:58 PM

Haha that's amazing to hear because I had a similar experience too! The data scientist on the team was re-initializing the model, the image matrices, and other objects in the data pipeline over and over again for every inference. I had to decouple those in a similar way as you did.

Everyone usually thinks the model is the problem but I am starting to think that the rest of the code is usually what actually needs to be optimized

werres123 t1_is0zlza wrote on October 12, 2022 at 2:49 PM

NVIDIA provides deepstream framework for optimisation. Also you can convert your model from FP32 to INT8 and improve the speed (albeit sacrificing some accuracy. Have to figure out the trade off). deepstream is available in c++ while python APIs are also available to implement your model the way you see fit.

muunbo OP t1_is16f01 wrote on October 12, 2022 at 3:35 PM

Oh wow I never heard about Deepstream in my earlier project. Have you used it? Were you able to bring it on mid-project or did you have to use it from the start of the project?

werres123 t1_is1l3lc wrote on October 12, 2022 at 5:12 PM

Deepstream is a framework that provides an efficient pipeline for taking video input, run detector and tracker and extract the output. (its capable of much more..its just a simple description). Take a look at the demo programs available in the deepstream installation folder..or in deepstream python api github page.

I have used it mainly for implementing object detection related activities.