ReginaldIII t1_ir6jxgl wrote on October 5, 2022 at 6:24 PM

Incredibly dense paper. The paper itself doesn't give us much to go on realistically.

The supplementary paper gives a lot of algorithm listings in pseudo python code, but significantly less readable than python.

The github repo gives us nothing to go on except for some bare bones notebook cells for loading their pre-baked results and executing them in JAX.

Honestly the best and most concise way they could possibly explain how they applied this on the matmul problem would be the actual code.

Neat work but science weeps.

maxToTheJ t1_ir6xkzx wrote on October 5, 2022 at 7:50 PM

Where is the paper reviewer especially given its a Nature article? You did a better job.

pm_me_your_pay_slips t1_ir79lnv wrote on October 5, 2022 at 9:06 PM

They know deepmind papers bring in readers.

harharveryfunny t1_ir76pn9 wrote on October 5, 2022 at 8:47 PM

Well, the gist of it is that they first transform the minimal-factors matmul problem into decomposition of a 3-D matrix into minimal number of factors, then use RL to perform this decomposition by making it a stepwise decomposition with the reward being mininum number of steps.

That said, I don't understand *why* they are doing it this way.

Why solve the indirect decomposition problem, not just directly search for factors of the matmul itself ?
Why use RL rather than some other solution space search method like an evolutionary algorithm? Brute force checking of all solutions is off the table since the search space is massive.

Mr_Smartypants t1_ir7rbxp wrote on October 5, 2022 at 11:14 PM

At the end of RL training, they don't just have an efficient matrix multiplication algorithm (sequence of steps), they also have the policy they learned.

I don't know what that adds, though. Maybe it will generalize over input size?

purplebrown_updown t1_ir8zsto wrote on October 6, 2022 at 5:59 AM

sounds like more marketing than substance, which deep mind is known for.

master3243 t1_ir9457u wrote on October 6, 2022 at 6:56 AM

They claim its provably correct and faster. Matmul is one of the most used algorithms and is heavily researched (and has major open problems)

Would you like to step up and prove yourself in that competitive area?

purplebrown_updown t1_ir97rwy wrote on October 6, 2022 at 7:48 AM

I don’t think you read the above post. You should so that you can stop drinking the kool aid.

master3243 t1_ir99ne0 wrote on October 6, 2022 at 8:16 AM

I have, and I always have skepticism about DL.

But the post above doesn't even levy any theoretical or practical problems with the paper. Claiming that it's dense or that it's missing a github repo are not criticisms that weaken a research paper. Sure they're nice to have but definitely not requirements.

ReginaldIII t1_ir9tsc8 wrote on October 6, 2022 at 12:26 PM

You're correct, I haven't pointed out anything wrong with the paper conceptually. It appears to work. Their matmul results are legitimate and verifiable. Their JAX benchmarks do produce the expected results.

In exactly the same way AlphaZero and AlphaFold do demonstrably work well. But it's all a bit moot and useless when no one can take this seemingly powerful method and actually apply it.

If they had released the matmul code yesterday people today would already be applying it to other problems and discussing it like we have done with StableDiffusion in recent weeks. But with a massively simplified pipeline to getting results because there's no dataset dependency, only compute, which can just be remedied with longer training times.

master3243 t1_iraj7rp wrote on October 6, 2022 at 3:33 PM

But the paper was released literally yesterday?!

How did you already conclude that "no one can [...] actually apply it"

No where else in science do we hold such scrutiny and its ridiculous to judge how useful a paper is without at least waiting 1-2 years to see what comes out of it.

ML is currently suffering from the fact that people expect each paper to be a huge leap on its own, that's not how science work or has ever worked. Science is a step by step process, and each paper is expected to be just a single step forward not the entire mile.

ginger_beer_m t1_irb2xdn wrote on October 6, 2022 at 5:41 PM

The paper was released yesterday, but they had months from the manuscript submission until reviewer acceptance to put up a usable GitHub repo. I guess they didn't bother because .. deepmind.

ReginaldIII t1_irakn7b wrote on October 6, 2022 at 3:43 PM

> How did you already conclude that "no one can [...] actually apply it"

Because I read the paper and their supplementary docs and realized there's no way anyone could actually implement this given its current description.

> ML is currently suffering from the fact that people expect each paper to be a huge leap on its own,

I don't expect every paper to be a huge leap I expect when a peer reviewed publication is publicly released in NATURE that it is replicable!

master3243 t1_iraqxox wrote on October 6, 2022 at 4:24 PM

I will repeat the same sentiment, it was released yesterday.

> publicly released in NATURE that it is replicable

It is replicable, they literally have the code.

ReginaldIII t1_irav3ov wrote on October 6, 2022 at 4:50 PM

So if the paper is ready to be made public. Why not release the code publicly at the same time.

> It is replicable, they literally have the code.

Replicable by the people who have access to the code.

If you are ready to publish the method in Nature you can damn well release the code with it! Good grief, what the fuck are you even advocating for?

master3243 t1_irbx3a6 wrote on October 6, 2022 at 9:10 PM

What???

I have no idea what you're talking about, their code and contribution is right here https://github.com/deepmind/alphatensor/blob/main/recombination/sota.py

Their contributions are lines 35, 80 88

[deleted] t1_irbz43l wrote on October 6, 2022 at 9:25 PM

[removed]

master3243 t1_irc4wwf wrote on October 6, 2022 at 10:11 PM

What are you talking about? They definitely don't need to release that (it would be nice but not required). By that metric almost ALL papers in ML fail to meet that standard. Even the papers that go above and beyond and RELEASE THE FULL MODEL don't meet you're arbitrary standard.

Sure the full code would be nice, but ALL THEY NEED to show us is a PROVABLY CORRECT SOTA matrix multiplication which proves their claim.

Even the most advanced breakthrough in DL (in my opinion) which is Alphafold where we have the full model, doesn't meet your standard since (as far as I know) we don't have the code for training the model.

There are 4 levels of code release

Level 0: No code released

Level 1: Code for the output obtained (only applies to outputs that no human/machine can obtain such as protein folding on previously uncalculated patterns or matrix factorization or solutions to large NP problems that can't be solved using classical techniques)

Level 2: Full final model release

Level 3: Full training code / hyperparameters / everything

In the above scale, as long as a paper achieves Level 1 then it proves that the results are real and we don't need to take their word for it, thus it should be published.

If you want to talk about openness, then sure I would like Level 3 (or even 2).

But the claim that the results aren't replicable is rubbish, this is akin to a mathematician showing you the FULL, provably correct, matrix multiplication algorithm he came up with that beats the SOTA and you claim it's "not reproducible" because you want all the steps he took to reach that algorithm.

The steps taken to reach an algorithm are NOT required to show that an algorithm is provably correct and SOTA.

EDIT: I think you're failing to see the difference between this paper (and similarly alphafold) and papers that claim that they developed a new architecture or a new model that achieves SOTA on a dataset. Because in that case, I'd agree with you, showing us the results is NOT ENOUGH for me to believe that you're algorithm/architecture/model actually does what you claim it does. But in this case, literally the result in itself (i.e. the matrix factorization) is enough for them to prove that claim since that kind of result is impossible to cheat. Imagine I release a groundbreaking paper that says I used DeepLearning to Prove P≠NP and attached a pdf document that has a FULL PROOF that P≠NP (or any other unsolved problem) and it's 100% correct, would I need to also release my model? Would I need to release the code I used to train the model? no! All I need to release for my publication would be the pdf that contains the theorem.

[deleted] t1_irc5eys wrote on October 6, 2022 at 10:15 PM

[removed]

master3243 t1_irc6to3 wrote on October 6, 2022 at 10:26 PM

I literally cannot tell if your joking or not!

If I release an algorithm that beats SOTA along with a full and complete proof would I also need to attach all my notes and different intuitions that made me take the decisions I took???????

I can 100% tell you've never worked on publishing improvements to algorithms or math proofs because NO ONE DOES THAT. All they need is 1-the theorem/algorithm and 2-Proof that it's correct/beats SOTA

ReginaldIII t1_irc7nt0 wrote on October 6, 2022 at 10:33 PM

I'm done.

You only care about the contribution to matmul. Fine.

There's a much bigger contribution to RL being used to solve these types of problems (wider than just matmul). But fine.

Goodbye.

master3243 t1_irdoyzz wrote on October 7, 2022 at 7:16 AM

> You only care about the contribution to matmul

False, which is why I said it would have been better if they released everything. I definitely personally care more about the model/code/training process than the matmul result.

However, people are not 1 dimensional thinkers, I can simultaneously say that deepmind should release all their recourses AND at the same time say that this work is worthy of a nature publication and aren't missing any critical requirements.

[deleted] t1_ire7cmq wrote on October 7, 2022 at 11:49 AM

[removed]

ReasonablyBadass t1_ir6zjcx wrote on October 5, 2022 at 8:02 PM

And since ML is a lot of matrix multiplication we get faster ML which leads to better matrix multiplication techniques...

finitearth t1_ir77jam wrote on October 5, 2022 at 8:53 PM

Matmul all the way down..

ActiveLlama t1_ir8fczf wrote on October 6, 2022 at 2:33 AM

It feels like we are getting closer and closer to the singularity.

ThatInternetGuy t1_ir7zmqm wrote on October 6, 2022 at 12:22 AM

And GPU is mainly a matrix multiplication hardware. 3D graphics rendering is a parallel matrix multiplication on the 3D model vertices and on the buffer pixels, so it's not really an unsolved problem, as all graphics cards are designed to do extremely fast matrix multiplication.

_matterny_ t1_ir8ht0g wrote on October 6, 2022 at 2:54 AM

But even a GPU has a maximum size matrix it can process. More efficient algorithms could improve GPU performance if they really are new.

Thorusss t1_ir9o8lk wrote on October 6, 2022 at 11:32 AM

Especially since the algorithm are specifically faster on the most modern hardware we have right now.

[deleted] t1_iredvuk wrote on October 7, 2022 at 12:55 PM

[deleted]

master3243 t1_ir94he8 wrote on October 6, 2022 at 7:01 AM

It Is an unsolved problem, there's no known optimal algorithm yet.

Unless you have a proof your hiding from the rest of the world?

> The optimal number of field operations needed to multiply two square n × n matrices up to constant factors is still unknown. This is a major open question in theoretical computer science.

ThatInternetGuy t1_ir96weg wrote on October 6, 2022 at 7:35 AM

https://developer.nvidia.com/blog/implementing-high-performance-matrix-multiplication-using-cutlass-v2-8/

Nvidia Tensor Cores implement GEMM for extremely fast matrix-matrix multiplication. This has never been figured out for ages; however, it's up to the debate if the AI could improve the GEMM design to allow an even faster matrix-matrix multiplication.

Matrix-Matrix Multiplication has never been slow. If it were slow, we wouldn't have all the extremely fast computing of neural networks.

If you were following the latest news of Machine Learning, you should have heard the recent release of Meta's AITemplate which speeds up inference by 3x to 10x. It is possible thanks to the Nvidia CUTLASS team who have made Matrix-Matrix Multiplication even faster.

master3243 t1_ir9a1st wrote on October 6, 2022 at 8:22 AM

Absolutely nothing you said contradicts my point that the optimal algorithm is an unsolved problem, and thus you can't claim that it's impossible for an RL agent to optimize over current methods.

ReginaldIII t1_ir9uakx wrote on October 6, 2022 at 12:30 PM

> however, it's up to the debate if the AI could improve the GEMM design to allow an even faster matrix-matrix multiplication.

Nvidia have been applying RL for chip design and optimization: https://developer.nvidia.com/blog/designing-arithmetic-circuits-with-deep-reinforcement-learning/

So I think it's pretty clear that they think it's possible.

ThatInternetGuy t1_ir9v9aj wrote on October 6, 2022 at 12:39 PM

Yes, 25% improvement.

My point is, Nvidia CUTLASS has practically improved matrix multiplication by 200% to 900%. Why do you guys think matrix multiplication is currently slow with GPU, I don't get that. The other guy said it's an unsolved problem. There is nothing unsolved when it comes to matrix multiplication. It has been vastly optimized over the years since RTX first came out.

It's apparent that RTX Tensor Cores and CUTLASS have really solved it. It's no coincidence that the recent explosion of ML progresses when Nvidia put in more Tensor Cores and now with CUTLASS templates, all models will benefit from 200% to 900% performance boost.

This RL-designed GEMM is the icing on the cake. Giving that extra 25%.

ReginaldIII t1_ir9w5x1 wrote on October 6, 2022 at 12:47 PM

> It's apparent that RTX Tensor Cores and CUTLASS have really solved it.

You mean more efficiency was achieved using a novel type of hardware implementing a state of the art algorithm?

So if we develop methods for searching for algorithms with even better op requirements, we can work on developing hardware that directly leverages those algorithms.

> Why do you guys think matrix multiplication is currently slow with GPU, I don't get that.

I don't think that. I think that developing new hardware and implementing new algorithms that leverage that hardware is how it gets even faster.

And it's an absurd statement for you to make because it's entirely relative. Go back literally 4 years and you could say the same thing despite how much has happened since.

> This has never been figured out for ages; however, it's up to the debate if the AI could improve the

> The other guy said it's an unsolved problem. There is nothing unsolved when it comes to matrix multiplication. It has been vastly optimized over the years since RTX first came out.

The "other guy" is YOU!

ThatInternetGuy t1_ir9zlmq wrote on October 6, 2022 at 1:15 PM

This is not the first time RL is used to make efficient routings on the silicon wafers and on the circuit boards. This announcement is good but not that good. 25% improvement in the reduction of silicon area.

I thought they discovered a new Tensor Core design that gives at least 100% improvement.

finitearth t1_ir77hre wrote on October 5, 2022 at 8:52 PM

Matmul all the way down..

Ulfgardleo t1_ir72pix wrote on October 5, 2022 at 8:22 PM

Why is this a nature paper?

Strassen is already known not to be the fastest known algorithms in terms of Floating point multiplications https://en.wikipedia.org/wiki/Computational_complexity_of_matrix_multiplication
already strassen is barely used because its implementation is inefficient except in the largest of matrices. Indeed, strassen is often implemented using a standard MatMul as smallest blocks and only used for very large matrices.
Measuring the implementation complexity in floating mul is kinda meaningless if you pay for it with a multiple of floating additions. It is a meaningless metric (see 2.)

neanderthal_math t1_ir7k9jl wrote on October 5, 2022 at 10:20 PM

I’m a little confused by the purpose of this paper too. If the point is to show that an RL algorithm found better bounds than Strassen, then that’s cool. But are they claiming that this is something that a compiler would use in practice? How does this work with fixed SIMD sizes.

Lairv t1_ir7p0xt wrote on October 5, 2022 at 10:56 PM

In the article they try 2 types of reward: minimizing the rank of the tensor decomposition (i.e. minimizing total number of multiplication), and minimizing the runtime of the algorithm on a given hardware (they tried with nvidia V100 and TPUv2)

The latter could be actually useful since their graphs shows that the algorithms discovered reach better performances than cuBLAS (Fig.5)

neanderthal_math t1_ir8pk44 wrote on October 6, 2022 at 4:05 AM

Thank you. That’s pretty cool.

AssumptionIcy5363 t1_irbuwfh wrote on October 6, 2022 at 8:54 PM

What I would do is train the model for matrixes for many small and many usefull combination of sizes.

Than I would use the normal algorithm for every other combination of sozes

golfstreamer t1_ir7cdtu wrote on October 5, 2022 at 9:25 PM

> Measuring the implementation complexity in floating mul is kinda meaningless if you pay for it with a multiple of floating additions. It is a meaningless metric (see 2.)

I was confused about this for a bit but I think it makes sense. When you use the algorithm for block matrix multiplication instead the multiplication operations far outweigh the addition operations. If done recursively this new method should provide lower complexity (e.g. O(n^2.78 ) vs O(n^2.8 )) than strassen's

master3243 t1_ir94pxk wrote on October 6, 2022 at 7:04 AM

I don't think you're right unless deepmind is lying in the abstract of a nature paper which I highly doubt.

> Particularly relevant is the case of 4 × 4 matrices in a finite field, where AlphaTensor’s algorithm improves on Strassen’s two-level algorithm for the first time, to our knowledge, since its discovery 50 years ago

Ulfgardleo t1_ir95y3t wrote on October 6, 2022 at 7:21 AM

Yeah they are not right. Sota is laser method.

They even missed the huge improvement from 1981...

https://ieeexplore.ieee.org/document/4568320

It is btw all behind the wiki link above.

Ulfgardleo t1_ir997hv wrote on October 6, 2022 at 8:09 AM

The worst thing is however that they do not even cite the practically relevant memory efficient implementation of strassen (https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.39.6887 ). One can argue that all matmul algorithms with better complexity than Strassen are irrelevant due to their constants, but not even comparing to the best memory implementation is odd-especially as they don't show improvement in asymptotic complexity.

victotronics t1_ir9fc44 wrote on October 6, 2022 at 9:41 AM

Yes yes and yes. Can half a dozen authors really be that ignorant that they don't know about all the work that's been done after Strassen? And how did this pass review?

To add to 2: numerical stability of Strassen is doubtful too.

emotionalfool123 t1_irdvwr8 wrote on October 7, 2022 at 9:10 AM

Can you explain what does numerical stability mean in this case?

victotronics t1_ire6ha5 wrote on October 7, 2022 at 11:39 AM

Behavior under roundoff. Floating point numbers are not actually mathematical numbers so all algorithms are inexact. You want them to be not too inexact: small perturbations should give only small errors. The fact that STrassen (and other algorithms) sometimes subtract quantities means that you can have numerical cancellation.

Thorusss t1_ir9or2m wrote on October 6, 2022 at 11:38 AM

The algorithm is plain faster on the most advanced hardware. For such an already heavily optimized area, that is very impressive.

>Leveraging this diversity, we adapted AlphaTensor to specifically find algorithms that are fast on a given hardware, such as Nvidia V100 GPU, and Google TPU v2. These algorithms multiply large matrices 10-20% faster than the commonly used algorithms on the same hardware, which showcases AlphaTensor’s flexibility in optimising arbitrary objectives.

https://www.deepmind.com/blog/discovering-novel-algorithms-with-alphatensor

Ulfgardleo t1_ir9z7bx wrote on October 6, 2022 at 1:12 PM

i answered you here:

https://old.reddit.com/r/MachineLearning/comments/xwfvlw/r_discovering_faster_matrix_multiplication/ir9xy3t/

Thorusss t1_irb9hoa wrote on October 6, 2022 at 6:25 PM

thanks

[deleted] t1_ir88qu5 wrote on October 6, 2022 at 1:36 AM

[deleted]

Capable-Speed5915 t1_irb8idy wrote on October 6, 2022 at 6:19 PM

As a way to demonstrate using AI for fundamental research ? I mean they did find algorithms with less multiplications.

It may be harder to apply or benefit from it (numerical stability), still would be a good scientific achievement and pushes the field forward towards exploring where else can such techniques be applied for algorithm discovery.

They also speculate about adding numerical stability as a metric, so maybe a future work at some point.

Either way, the paper does indeed do something new, and opens up new research avenues to explore.

Singularian2501 t1_ir696wb wrote on October 5, 2022 at 5:15 PM

Blog Article from Deepmind about the paper: https://www.deepmind.com/blog/discovering-novel-algorithms-with-alphatensor

foreheadteeth t1_irbhdfb wrote on October 6, 2022 at 7:19 PM

I'm a mathematician, one of my areas of research is linear algebra. To multiply 2x2 matrices, you do this:

[a b][e f]   [ae+bg af+bh]
[c d][g h] = [ce+dg cf+dh]

In the expression above, there are R=8 products. You can apply the expression above recursively to compute the product of N by N matrices "blockwise", where N = 2^n , but then the overall running time is O(N^3 ), which is the same as naive matrix multiplication. That being said, recursive blockwise multiplication is much faster than a more naive implementation on real hardware, for reasons of cache coherence (as much as a factor of 10 or more).

Strassen was able multiply 2x2 matrices with only R=7 products. If you apply Strassen's algorithm recursively to matrices of size N = 2^n , then the running time is O(N^{2.8074}), here 2.8074 is really log_2 (7). I think nowadays, it is considered that these algorithms become useful when N>1,000 or so.

I think everyone is pretty much convinced that for 2x2 matrices, R = 7 is best possible. So then people tried k by k matrices, with k=3,4,etc..., with the hope of using much less multiplications than the naive algorithm, which is R = k^3 .

If you can find a matrix multiplication algorithm for small k by k matrices that takes R < k^3 multiplications, then you can use it recursively to find ever more efficient block matrix multiplication algorithms for size N = k^n , but of course if k is large, then N becomes insanely large.

For some large values of k, people have found algorithms with very small R < k^3 , and as a result, at least theoretically, we know how to multiply matrices in O(N^{2.3728596} ) time. However, this is purely theoretical, since the corresponding value of k is large, so N is astronomical.

In the present paper, the authors have investigated matrix multiplication algorithms for multiplying rectangular matrices A and B, with the following sizes:

A is p by q, and B is q by r, with p,q,r ∈ { 2,3,4,5 }.

In most cases, the best value of R that they found is what was already in the literature.

In the cases (p,q,r) ∈ { (3,4,5), (4,4,5), (4,5,5) } , this paper improved the value of R.

There are some further improved values of R if (p,q,r) ∈ { (4,4,4), (5,5,5) } but only in the case of mod-2 arithmetic.

Although the authors did not improve R in most cases, they claim they also made improvements of a different kind, which is easiest to explain by comparing Strassen and Winograd's algorithms (link is above).

As previously mentioned, Strassen's algorithm multiplies 2x2 matrices with R = 7 products, but it uses 18 additions or subtractions. Winograd's algorithm also multiplies 2x2 matrices with R=7 products, but uses a mere 15 additions and subtractions. Thus, for finite values of N, the recursive version of Winograd is slightly faster than recursive Strassen.

In the present paper, the authors also claim to have found new algorithms that require fewer additions and subtractions, even if the value of R is not improved compared to what was already known. As a result, recursive versions of these new algorithms are slightly faster for finite values of N, even of the big-O performance is not improved.

Aloekine t1_ircx86k wrote on October 7, 2022 at 2:13 AM

Thanks, this was a helpful post. If I could ask a question,

Leaving aside the point about this being discovered with DRL (which is obviously astounding and cool), I’m trying to get a better sense of how widely applicable these new improvements found are. There’s another poster in the thread who’s much more pessimistic about this being particularly applicable or the benchmarks being particularly relevant, what’s your perspective?

foreheadteeth t1_irdomjk wrote on October 7, 2022 at 7:10 AM

The matrix multiplication exponent currently sits at ω=2.3728596 and I don't think that's been changed with this paper. I don't think that any of the papers improving ω were in Nature, even though those are quite important papers, so I would advise people working in this field to send it to Nature in the future. That being said, I suspect Nature is not genuinely interested in improvements to ω; I suspect that the attraction for Nature is mainly in the "discovered with DRL", as you say.

In Fig 5, they claim speedups compared to standard implementations for matrix size N>8192, up to 23.9%. Also, they are faster than Strassen by 1-4%. That's not bad!

They also mentioned a curious application of their DRL, to find fast multiplication algorithms for certain classes of matrices (e.g. skew matrices). I don't think they've touched ω in that case either, but if their algorithms are genuinely fast at practical scales, that might be interesting, but again I didn't notice any benchmarks for reasonable matrix sizes.

(Edit: I had failed to notice the benchmark in Fig 5).

bigfish_in_smallpond t1_ir6gixy wrote on October 5, 2022 at 6:02 PM

10-20% faster matrix multiplication algorithms is very impressive. Justifies all the money spent haha

ReginaldIII t1_ir6ixyl wrote on October 5, 2022 at 6:17 PM

Faster, higher throughput, less energy usage... Yes it literally pays for itself.

M4mb0 t1_ir7396d wrote on October 5, 2022 at 8:25 PM

Not really. There are other reason why fast matrix multiplication almost like Strassen are not used in practice, and are more of theoretical importance than practical. In particular, numerical stability is often a concern.

Thorusss t1_ir9ps2x wrote on October 6, 2022 at 11:48 AM

True. But nummerical stability is much more important in long running simulations like weather forecast, than in deep neural network training.

There is a reason they are often benchmarked with single or even half precision.

Ulfgardleo t1_ir7508n wrote on October 5, 2022 at 8:36 PM

no, because these algorithms are terribly inefficient to implement as SIMD. They have nasty data access patterns and need many more FLOPS when also taking additions into account (just the last steps of adding the elements to the result matrix are more than twice the additions of a standard matmul in the case of the results shown here)

neanderthal_math t1_ir7l0k3 wrote on October 5, 2022 at 10:26 PM

In practice, do libraries like CUDA and MKL do Matrix multiplication the standard way or do they have fancy decompositions?

I remember when I was young, the atlas library would look at your hardware and do a bunch of matmuls and figure out what the “optimal” configuration would be for your system.

Ulfgardleo t1_ir7lytl wrote on October 5, 2022 at 10:33 PM

All Standard unless very large. Atlas is just picking different kernels that "only" change order of operations to maximize CPU utilization.

Red-Portal t1_ir7xeyo wrote on October 6, 2022 at 12:03 AM

The funny thing is that the lesson of ATLAS and OpenBLAS was that, matrix multiplication optimized to the assembly level by humans is still the best way to squeeze out performance.

harharveryfunny t1_ira5afy wrote on October 6, 2022 at 1:58 PM

cuDNN supports Winograd on CUDA cores (not sure about Tensor cores) for convolution, but only for certain filter sizes such as 3x3.

Thorusss t1_ir9pbcd wrote on October 6, 2022 at 11:43 AM

So why is matrix multiplication faster with it?:

>Leveraging this diversity, we adapted AlphaTensor to specifically find algorithms that are fast on a given hardware, such as Nvidia V100 GPU, and Google TPU v2. These algorithms multiply large matrices 10-20% faster than the commonly used algorithms on the same hardware, which showcases AlphaTensor’s flexibility in optimising arbitrary objectives.

Are you saying it would be slower, if it had to multiply multiple matrixes of the same dimension one after the other?

Ulfgardleo t1_ir9xy3t wrote on October 6, 2022 at 1:01 PM

You seem to be confused.

Experiment 1 uses small 5x5 matrices. Not block-matrices. There they only count the number of mults. These are not faster than SIMD implementations of 5x5 matrix mults, otherwise they would have shown it off proudly.
Experiment 2 was about 4x4 block-matrices. But here the 10-20% faster than the COMMONLY used algorithms is actually an overstatement of the results. For GPUs, their implementation is only 5% faster than their default jax implementation of Strassen. The difference to TPU could just mean that their Jax compiler sucks for TPUs. (//Edit: by now i low-key assume that the 10-20% refers to standard cBLAS because i do not get 20% compared to strassen for any result in Figure 5 (and how could they, because they never even get more than 20% improvement over cBLAS.))
They do not cite any of the papers that are concerned with efficient implementation of strassen. Especially the efficient memory scheme, from 1994. https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.39.6887 it is unclear whether a GPU implementation of that would be faster, since they are not even discussing the GPU implementation of their strassen variant. They do not claim that their algorithm is faster in complexity, so we are completely reliant on that their implementation of strassen makes sense.

mgostIH t1_ir7euf7 wrote on October 5, 2022 at 9:42 PM

You can apply it on the top call of your matrix mul and do everything inside the standard way, you still gain the efficiency since these algorithms also work in block matrix form.

Ulfgardleo t1_ir7m5md wrote on October 5, 2022 at 10:34 PM

Is it? I could not see from the paper whether they assume non-commutative multiplication in their small matrix optimization.

//Edit: they do a 4x4 block matrix, but the gains are less than 5% over the existing Strassen algorithm.

[deleted] t1_ir6x684 wrote on October 5, 2022 at 7:48 PM

[deleted]

DigThatData t1_ir85b9f wrote on October 6, 2022 at 1:08 AM

> less energy usage.

you wish.

Lairv t1_ir7qmlc wrote on October 5, 2022 at 11:08 PM

Cool paper, worth noting that such systems requires huge resources to be trained, they quickly mention it in the appendix "1.600 actors TPUv4 to play games, and 64 TPUv3 to train the networks, during a week". For reference, AlphaZero for Go was trained with 5.000 actors TPUv1 to generate games, and 64 TPUv2 to train networks, during 8 hours. I still find it unfortunate that not much work has been done to reduce resources needed to train AlphaZero-like systems, which is already 5 years old

Thorusss t1_ir9oyt3 wrote on October 6, 2022 at 11:40 AM

So? They had to train once, the more efficient algorithm is now in humanities toolbox till eternity. 10-20% increased speed can probably pay that back this year with the compute DeepMind uses alone.

Lairv t1_ira9aul wrote on October 6, 2022 at 2:27 PM

My point is that considering that these methods can be applied in about any scientific field, it would be beneficial if not only Google, Microsoft, Facebook and OpenAI could train them

victotronics t1_ir9f2a0 wrote on October 6, 2022 at 9:37 AM

Ok, I'm no expert but

> improves on Strassen’s two-level algorithm for the first time, to our knowledge, since its discovery 50 years ago

looks very suspicious. There has been *tons* of work on improving Strassen. It would be mind-blowing if they didn't know about that research.

Then: Strassen and its further developments are theoretical curiosities. Numerically they suffer from grave instabilities.

This stuff should really be posted in r/math.

loukitkhemka t1_ir69j10 wrote on October 5, 2022 at 5:17 PM

Awesome. I wonder if anyone would use this method to find good algorithms for NP Hard problems. It would be amazing if we can find good algorithms this way.

flyfrog t1_ir6b1zc wrote on October 5, 2022 at 5:27 PM

Heuristic approaches like alphafold seem better for that class of problems. This model creates provable solutions, which would be amazing for NP, but not likely.

Soundwave_47 t1_ir7olyd wrote on October 5, 2022 at 10:53 PM

Agree that some intuition will probably be required for NP-Hard problems to encode knowledge that we've learned in other fields. A wholly probabilistic model would be harder.

Gere1 t1_irj4sg3 wrote on October 8, 2022 at 4:45 PM

Is it useful though? I had the impression that the Strassen algorithm is already an optimization and yet I'm not aware that it is used in practice on GPUs. Am I wrong and it is used in NVidia GPUs or is it a gimmick not worth building for? Maybe it's easier to do the conventional matrix multiplication on hardware and parallelize that?

[deleted] t1_irqrzv1 wrote on October 10, 2022 at 9:55 AM

[removed]

[deleted] t1_ir6dx4s wrote on October 5, 2022 at 5:45 PM

[deleted]

-bb_ t1_ir73phc wrote on October 5, 2022 at 8:28 PM

Last month I've tried for fun the very similar thing (but on the smaller scale (3x3 @ 3x3) and with a different approach), and only when I read this paper I've realized, that there are already algorithms for exact same thing I've tried :c We can represent each entry in 3x3 matricies with a letter and there are only 81 possible combinations of 2 characters from the first two matrices, such that they are from different matrices. These pairs will represent multiplications. So I decided that I'll have a unique slot for each of them in the vector of len 81. Then you need to create several pairs of sums of single characters (each sum consists of elements from one matrix). Sums in pairs will be multiplied, and the result of this multiplication can be represented as a vector mentioned earlier. We can also represent each cell in the resulting matrix as a vector. So if we create enough pairs, we will just need to solve L0-regularized linear system AX = b, where A is stacked vectors of sums, b vectors from resulting matrix (I hoped I can solve it by just using Lasso (because L1 is sort of approximating L0) and it will choose only suitable components). Note that this system without regularization will have infinitely many solutions: A is 81 x #(pairs of sums), X is #(pairs of sums) x 9, b is 81 x 9. And solution will be better than Strassen's algo if ||X.sum(-1)||_L0 / 27 < 7/8 (because Strassen's algo replaces 8 multiplications with 7) and now, when I finally read the paper I also know that it should be better than Laderman's one, so ||X.sum(-1)||_L0 / 27 < 23/27.

AllowFreeSpeech t1_ir79coc wrote on October 5, 2022 at 9:05 PM

Is it limited to matmul or is it actually and demonstrably a generic algorithm optimizer?

undefdev t1_ir70h73 wrote on October 5, 2022 at 8:08 PM

Amazing! Faster matrix multiplication for everyone!

Could someone who knows a little about reinforcement learning tell us why they didn't build upon muzero, or muesli though?

They don't even mention them, so maybe the answer should be rather clear, but I barely know anything about reinforcement learning.

Comments