Submitted by seattleite849 t3_10ryu6b in MachineLearning
Hi friends! I ran into this problem enough times at my last few jobs that I built a tool to solve it. I spent many hours building Docker containers for my Python functions, as many of the data science modules required building C libraries (since they significantly speed up compute-intensive routines, such as math calculations). Deploying the containers to AWS Lambda or Fargate (if the processes required more CPU or memory or were >15 minutes) and wiring functions to talk to each other using queues, databases, and blob storage made iterating on the actual code, which wasn't even that complex most of the time, slow.
I made cakework https://github.com/usecakework/cakework, a platform that lets you spin up your Python functions as serverless, production-scale backends with a single command. Using the client SDK, you submit requests, check status, and get results. You can also specify the amount of CPU (up to 16 cores) and memory (up to 128GB) for each individual request, which is helpful when your data size and complexity varies across different requests.
A common pattern that I built cakework for is doing file processing for ML:
- ingest data from some source daily, or in response to an external event (data written to blob storage)
- run my function (often using pandas/numpy/scipy)
- write results to storage, update database
- track failures and re-run/fix
It's open source <3. Here are some fun examples to get you started: https://docs.cakework.com/examples
Would love to hear your thoughts!
BasilLimade t1_j6ykjaz wrote
I'm looking at making a docker image to host on AWS ECR, to contain some python code and dependencies (over 250MB of dependencies, so I can't just zip up my modules as a lambda "layer"). How does this compare to making my own docker lambda image?