plocco-tocco t1_jdj9is4 wrote on March 24, 2023 at 8:02 PM

Reply to comment by ThirdMover in [D] I just realised: GPT-4 with image input can interpret any computer screen, any userinterface and any combination of them. by Balance-

It woulde be quite expensive to do tho. You have to do inference very fast with multiple images of your screen, don't know if it is even feasible.

ThirdMover t1_jdjf69i wrote on March 24, 2023 at 8:40 PM

I am not sure. Exactly how does inference scale with the complexity of the input? The output would be very short, just enough tokens for the "move cursor to" command.

plocco-tocco t1_jdjx7qz wrote on March 24, 2023 at 10:47 PM

The complexity of the input wouldn't change in this case since it's just a screen grab of the display. Just that you'd need to do inference at a certain frame rate to be able to detect the cursor, which isn't that cheap with GPT-4. Now, I'm not sure what the latency or cost would be, I'd need to get access to the API to answer it.

thePaddyMK t1_jdlr6bp wrote on March 25, 2023 at 9:52 AM

There is a paper that operates a website to generate traces of data to sidestep tools like Selenium: https://mediatum.ub.tum.de/doc/1701445/1701445.pdf

It's only a simple NN, though, no LLM behind it.