hjmb

hjmb t1_j1v0l07 wrote

For 1:

This seems feasible to implement but easy to circumvent - a single change and the hash no longer matches. If you instead store an embedding in some semantic space you might be at least able to say "we generated something a lot like this for someone, once", but that's as good as you will get.

A similar idea is to embed watermarks in the artifacts. Stable Diffusion by default does this, (I believe using this package), but forkers down the line have intentionally removed it.

Unfortunately it seems as sensible and feasible (and serves the same purpose of attribution) as cryptographically signing output in other fields, and we haven't had much luck persuading people to do that either.

It also doesn't apply to models run locally.

For 2:

That page you link to points out that the model is not deemed fit for purpose by the researchers (at least, when on its own) and that they expect this problem to only get more challenging as model sizes increase.

For someone trying to circumvent the discriminator; if they have access to the discriminator then they can adjust their artifacts so the discriminator no longer flags them.

I don't believe either solution is robust against bad actors. I also don't think attribution itself solves the problems caused by human-plausible content generation, but that is almost certainly a perspective thing.

Finally: This is not my field. Any corrections to the above please let me know.

11

hjmb t1_isgaxow wrote

I would be wary - AI approaches tend to give you plausible answers, not true answers. Also it may be worth updating your post to make it clear that you're looking for AI solutions to your problem, rather than looking for data cleaning advice for a dataset that you are going to feed into a machine learning system (which is what I inferred)

1

hjmb t1_isg9v1h wrote

Fuzzy matching will help with the typos, but from experience we crafted nicknames by hand.

If your jurisdiction(s) have accessible company records then you could match on those names to determine which rows are official names. This solves half your problem, as you then just need to match the remaining rows to an accepted official name.

You could also modify Levenshtein distance so that dropping characters is free in an attempt to match full names with shorter names, but this will be computationally expensive.

4