Comments
That_Violinist_18 t1_j88ilse wrote
I keep hearing this argument, but I also keep hearing that models are hitting 60%+ of peak throughput for GPUs when optimizations like FlashAttention and other things are considered.
So how much room is there for alternative architectures when the current hardware only leaves at most 40% of its peak performance on the table?
currentscurrents t1_j8agutn wrote
GPU manufacturers are aware of the memory bandwidth limitation, so they don't put in more tensor cores than they would be able to feed with the available memory bandwidth.
Notice that the A100 actually has less tensor cores than the V100. The tensor cores got faster, but they're still memory bottlenecked, so there's no advantage to having more of them.
That_Violinist_18 t1_j8ed3j9 wrote
So should we expect much higher peak throughput numbers from more specialized hardware?
I have yet to hear of any startups in the ML hardware space advertising this.
currentscurrents t1_j8em94v wrote
Samsung's working on in-memory processing. This is still digital logic and Von Neumann, but by putting a bunch of tiny processors inside the memory chip, each has their own memory bus they can access in parallel.
Most research on non-Von-Neumann architectures is focused on SNNs. Both startups and big tech are working on analog SNN chips. So far these are proof of concept; they work and achieve extremely low power usage, but they're not at a big enough scale to compete with GPUs.
erf_x t1_j87yrgi wrote
Cerebras does this
norcalnatv OP t1_j84wfs7 wrote
"Our model is built from the ground up on a per-inference basis, but it lines up with Sam Altman’s tweet and an interview he did recently. We assume that OpenAI used a GPT-3 dense model architecture with a size of175 billion parameters, hidden dimension of 16k, sequence length of 4k,average tokens per response of 2k, 15 responses per user, 13 million daily active users, FLOPS utilization rates 2x higher than FasterTransformer at <2000ms latency, int8 quantization, 50% hardware utilization rates due to purely idle time, and $1 cost per GPU hour. Please challenge our assumptions"
LetterRip t1_j85b07d wrote
Why not int4? Why not pruning? Why not various model compression tricks? int4 halves latency. At minimum they would do mixed int4/int8.
https://arxiv.org/abs/2206.01861
Why not distillation?
https://transformer.huggingface.co/model/distil-gpt2
NVidia using FasterTransformer and Triton inference server has a 32x speed up over baseline GPT-J,
I think their assumptions are at least an order of magnitude pessimistic.
As someone else notes, the vast majority of queries can be cached. Also there would likely be a Mixture of experts. No need for the heavy duty model when a trivial model can answer the question.
norcalnatv OP t1_j84wt52 wrote
If the ChatGPT model were ham-fisted into Google’s existing search
businesses, the impact would be devastating. There would be a $36
Billion reduction in operating income. This is $36 Billion of LLM
inference costs.
Himalun t1_j8593ax wrote
It’s worth noting that both MS and Google own the data centers and hardware so it is likely cheaper for them to run. But still expensive.
Downchuck t1_j8500e1 wrote
Perhaps the number of unique queries is overstated: through vector similarity search and result caching, the vast majority of lookups would be duplicate searches already materialized. OpenAI has now introduced a "premium" option suggesting a market for premium search - suggesting room for more cash inflows. This may change their spend strategy, perhaps spending less on marketing and more on hardware.
currentscurrents t1_j86gori wrote
In the long run, I think this is something that will be solved with more specialized architectures for running neural networks. TPUs and Tensor Cores are great first steps, but the Von Neumann architecture is holding us back.
Tensor Cores are very fast. But since the Von Neumann architecture has separate compute and memory connected by a bus, the entire network has to travel through the memory bus for every step of training or inference. The overwhelming majority of time is spent waiting on this:
>200 cycles (global memory) + 34 cycles (shared memory) + 1 cycle (Tensor Core) = 235 cycles.
A specialized architecture that physically implements neurons on silicon would no longer have this bottleneck. Since each neuron would be directly connected to the memory it needs (weights, data from previous layer) the entire network could run in parallel regardless of size. You could do inference as fast as you could shovel data through the network.