mikljohansson
mikljohansson t1_j87y870 wrote
Trying to teach my daughter and her cousins a bit about programming and machine learning. We're building a simple robot with an object detection model and Scratch block programming, so they can get it to chase after the objects it recognises. It works fine, but the kids seem to enjoy driving the robot around via remote and looking through its camera more than programming it 😅 There's an image in the repo readme
mikljohansson t1_j7p0o1o wrote
Reply to comment by ramv0001 in [Discussion] Best practices for taking deep learning models to bare metal MCUs by ramv0001
Nope, haven't used any emulators for this project. The ESP32 hardware I've been using is so cheap and convenient to use that there's been no need
mikljohansson t1_j7ok819 wrote
What kind of MCU are you targeting? It depends a lot of the capabilities of the MCU, how fast is it, how much memory, does it have a dedicated NPU/TPU, vector instructions, ..
mikljohansson t1_j7ojjjm wrote
I have been building a PyTorch > ONNX > TFlite > TFMicro toolchain for a project to get a vision model running on an ESP32-CAM with PlatformIO and Arduino framework. Perhaps it could be of use as a reference
https://github.com/mikljohansson/mbot-vision
Some caveats to consider when embarking on this kind of project
-
PyTorch/ONNX is channels-first memory format, while tensorflow is channels-last. Converting the model with onnx-tf inserts lots of Transpose ops in the graph which decreases performance (with 3x for my model) and increased memory usage. I'm using onnx2tf module instead, which also coverts operators to channels-last
-
You may want to fully quantize the model to int8, since fp16/fp32 is really slow on smaller MCUs, especially those lacking FPUs and vector instructions. And watch out for Quantize/Dequantize ops in the converted graph, it means some op didn't support quantization so needed to be wrapped and executed (slowly) in fp16/fp32 mode
-
There may be lots of performance to gain by using hardware optimized kernels, but it depends on what MCU and what operators your model is using. E.g. for ESP32 there's ESP-NN which greatly sped up inference times for my project (2x)
https://github.com/espressif/esp-nn https://github.com/espressif/tflite-micro-esp-examples
And for really tiny MCUs there's this library which could perhaps be useful, it doesn't support so many operators but it does work in my testing for simple networks
https://github.com/sipeed/TinyMaix
- How to figure out memory needs and performance. It's a bit trickier, I've simply been using for example torchinfo module, and the graph output and graph statistics that onnx2tf displays to see how many muls the model is using and the approximate parameter and tensor memory usage. Then I've had an improvement cycle where I've "trained" the model for 1 step, deployed it to the hardware to measure the FPS and then adjust the hyperparameters and model architecture until I have an FPS that is acceptable. Then train it fully to see if that model config can do the job. And then iterate...
mikljohansson t1_jckedf9 wrote
Reply to [R] RWKV 14B ctx8192 is a zero-shot instruction-follower without finetuning, 23 token/s on 3090 after latest optimization (16G VRAM is enough, and you can stream layers to save more VRAM) by bo_peng
Very interesting work! I've been following this project for a while now
Can I ask a few questions?
What's the difference between RWKV-LM and ChatRWKV, e.g. is ChatRWKV mainly RWKV-LM but streamlined for inference and ease of use, or is there more differences?
Are you planning to fine tune on the Stanford Alpaca dataset (like was recently done for LLaMa and GPT-J to create instruct versions of them), or a similar GPT-generated instruction dataset? I'd love to see a instruct-tuned version of RWKV-LM 14B with a 8k+ context len!