BitterAd9531

BitterAd9531 t1_j5idapl wrote

>trivially obvious that AI should never be open-source

Wow. Trivially obvious? I'd very much like to know how that statement is trivially obvious, because it goes against what pretty much every single expert in this field advocates.

Obviously open-source AI brings problems, but what is the alternative? A single entity controlling one of the most disrupting technologies ever? And ignoring for a second the obvious problems with that, how would you enforce it? Criminalize open-sourcing of software? Can't say I'm a fan of this line of thinking.

5

BitterAd9531 t1_j5g52os wrote

>Besides that, OP stated that he wants to use a llm for this, not me.

Actually I didn't. If you read my comment you'd understand I would need the LLM to demonstrate the model that does the actual combining (which obviously wouldn't be an LLM). Seeing as there are currently no models that have watermarking, I'd have to write one myself to test the actual model that does the combining to circumvent the watermark. Either you didn't understand this, or you're once again taking single sentences out of context and making semi-valid points that don't have any relevancy to the orignal discussion.

But honestly I feel like this is completely besides the point. I've given you a high-level explanation of how these watermarks can be defeated and you seem to be the only one who does not understand how they work.

4

BitterAd9531 t1_j5fcby3 wrote

>If you think, you can take two watermarked LLMs and 'trivially" combine their output as you stated, explain in detail how you do that in an automated way.

No thank you, I'm not going to write an LLM from scratch for a Reddit argument. And FWIW, I suspect that even if I did, you'd find some way to convince yourself that you're not wrong. You not understanding how this works doesn't impact me nearly enough to care that much. Have a good one.

18

BitterAd9531 t1_j5fal5s wrote

>no one seems to be even considering dealing with it in a serious way

Everyone has considered dealing with it, but everyone who understands the technology behind them also knows that it's futile in the long term. The whole point of these LLMs it to mimic human writing as closely as possible and the more they succeed, the more difficult it becomes to detect. They can be used to output both more precise and more variated text.

Countermeasures like watermarks will be trivial to circumvent while at the same time restricting the capabilities and performance of these models. And that's ignoring the elephant in the room, which is that once open-source models come out, it won't matter at all.

>this is the most pressing ethical issue in AI safety today

Why? It's been long known that the difference between AI and human capabilities will diminish over time. This is simply the direction we're going. Maybe it's time to adapt instead of trying to fight something inevitable. Fighting technological progress has never worked before.

People banking on being able to distinguish between AI and humans will be in for a bad time the coming few years.

42

BitterAd9531 t1_j5f5olr wrote

I think you are misunderstanding how these watermarks work. The watermark is encoded in the tokens used and so combining or rewriting will weaken the watermark to the point it can no longer be used to accurately detect. Robust means a few tokens may be changed, but changing enough tokens will have an impact eventually.

The semantics don't change because in language, there are multiple ways to describe the same thing without using the same (order of) words. That's literally what "rewriting" means.

21

BitterAd9531 t1_j5erse4 wrote

Won't work in the long term. OpenAI might have been the first one to release, but we know other companies have better LLMs and others will catch up soon. When that happens, models without watermarks will be released and people who want output without a watermark will use that model.

And even if you somehow force all of them to implement a watermark, it would be trivial to combine outputs of different models to circumvent it. Not to mention that slight rewrites by a human would probably break most watermarks, the same way they break the current GPT detectors.

158

BitterAd9531 t1_j57z7zd wrote

Chinese Room is once again one of these experiments that sound really good in theory but has no practical use whatsoever. It doesn't matter if the AI "understands" or not if you can no longer tell the difference.

It's similar to the "feeling emotions vs emulating emotions" or "being conscious vs acting conscious" discussion. As long as we don't have a proper definition for them, much less a way to test them, the difference doesn't matter in practice.

10

BitterAd9531 t1_j41gjo4 wrote

Ah my bad. I think you could make it a bit more clear in your post but it's definitely on me for misunderstanding. If the information about the residence was given in the document itself then it becomes a lot more doable.

I still see quite few problems such as neighbourhood, etc. influencing the price, which means you'd need an absolutely huge dataset with very detailed features. And even then I think the accuracy will still not be optimal. Then there's still the issue with scraping competitors data from their website, which I doubt is legal.

It really depends on what this will be used for. Want to use this to recommend houses to potential buyers in a certain price range? Absolutely doable, but it seems completely overkill for an application like that. Want to use it to replace humans who's job it is to give price estimations? Probably not a good idea.

1

BitterAd9531 t1_j411ihw wrote

I'm not even convinced it's possible based on the requirements. You're not going to get structured data. Just pictures of the outside and inside of the house I assume. How are you going to reliably estimate livable space, current state, or even number of rooms when not even all rooms might be properly pictured. You're banking on extracting these features from what I assume to be suboptimal images with high accuracy (very doubtful tbh) and then estimating price based on the features, which is useless if the features aren't extracted properly from the images.

Even if this was possible with high enough accuracy, the dataset you would need for this has be absolutely huge. I really don't believe someone can gather enough in 6 months while simultaneously developing the nn.

And then we're not even talking about the legality of scraping competitors websites to compare them to.

I'm not convinced I could do this in 6 months and I wouldn't do it for that price.

2