Submitted by rajatarya t3_znfgap in MachineLearning

Thanks to everyone who replied to our earlier post requesting pre-launch product feedback! We’re excited to announce that we’ve now publicly launched XetHub, a collaborative storage platform for data management.

I’ve been in the MLOps space for ~10 years, and data is still the hardest unsolved open problem. Code is versioned using Git, data is stored somewhere else, and context often lives in a 3rd location like Slack or GDocs.

This is why we built XetHub, a platform that enables teams to treat data like code, using Git.

Unlike Git LFS, XetHub doesn’t just store the files. It uses content-defined chunking and Merkle Trees to dedupe against everything in history, allowing small changes in large files to be stored compactly. Here’s how it works: https://xethub.com/assets/docs/how-xet-deduplication-works

XetHub includes a GitHub-like web interface that provides automatic CSV summaries and allows custom visualizations using Vega. And we know how painful downloading a huge repository can get, so we built Git-Xet mount—which, in seconds, provides a user-mode filesystem view over the repo.

Today, XetHub works for 1 TB repositories, and we plan to scale to 100 TB in the next year. Our implementation is in Rust (client & cache + storage) and our web application is written in Go.

XetHub is available today for Linux & Mac (Windows coming soon) and we’d love for you to try it out!

More info here:

https://preview.redd.it/t9tf3kt5i96a1.png?width=1740&format=png&auto=webp&s=184dd57d9f3d4e1dea94f8ab02211f663e214e84

260

Comments

You must log in or register to comment.

Retarded_Rhino t1_j0gn93r wrote

Wow this is stupidly good, congrats! You should share over r/rust as well, they will love this

48

kkngs t1_j0gv20y wrote

How big can a single file be?

20

rajatarya OP t1_j0gvprn wrote

Great question! There isn't a limit on single file size. Since we chunk each file into blocks the total size of the file isn't a limiting factor. Right now overall repo sizes can be up to 1TB but we have plans to scale that to 100TB in the next year.

Would love it if you could try out XetHub and share your thoughts after using it a bit.

25

kkngs t1_j0gx2rj wrote

Whats your model here? Is this just offered as a cloud based service? Can we host our own? The case where I'm interested has some challenges with client data, data residency, security etc.

10

Keepclamand- t1_j0gxj2o wrote

Agree data is biggest challenge. I haven’t tried xet hub yet will check it out. Some questions do you support all data type - txt, image etc.

Also can you track versioning of data.
One big need is to map which model was trained on version of data.

Do you have apis?

7

rajatarya OP t1_j0gxo3b wrote

We are still early in our thinking of business model - so would love to hear your thoughts on this.

In general, we are thinking about usage-based pricing based on compute, storage, and transfer.

Right now we offer a cloud-based multi-tenant service. We can also deploy into your cloud environment (VPC) as a single-tenant offering.

I would love to hear more about the use case you are thinking about - please DM me to talk more about it (and to hear more details on single-tenant offering).

13

rajatarya OP t1_j0gyotd wrote

Great questions. Definitely check us out - within 15m of getting started you'll experience answers to your questions :)

  1. Do you support all data types?
    Yes, all file types are supported. The level of deduplication we can achieve varies by file type (some file types are already compressed) but all file types are supported. We have some great example repos with images, text, and other data types.

  2. Can you track versioning of data?
    Yes, since you are just using Git - each commit captures the version of the data (since the data is just files in the repo). This way you have full collaboration features of Git while having full reproducibility. With the added benefit of having confidence that the code will work with the data at each commit.

  3. Do you have APIs?
    Not today. Can you tell me what sort of APIs would be interesting to you? We built Xet Mount specifically for use cases when you don't want to download the entire repo - instead you mount it and get a filesystem view over the repo and stream in the files you want to explore/examine/analyze.

Do check out XetHub - I would love to hear your feedback!

7

rajatarya OP t1_j0gzh2r wrote

Oh I forgot to mention - yes! mapping model to training data is a key part of reproducibility. 100% agree!

Using XetHub you can _finally_ commit the data, features, models, and metadata all in one place (along with the code). Have full confidence everything is aligned & working.

4

Liorithiel t1_j0h19at wrote

> finally

I was doing so with git annex for a long time, so this is a bit of a stretch that it wasn't possible in the past. Kind of a Schmidhuber moment…

Still, nice work with the merkle tree!

2

jakethesnake_ t1_j0h1m7l wrote

Can I put all my data in s3, then use XueHub to manage it?

1

hughperman t1_j0h1m8t wrote

I'd have big concerns around the areas of data protection laws, governance and storage. No on-prem/self-managed use cases? Data regions & localization? Anyone outside US region will be doing cross-region transfers, which have GDPR and other considerations.
I'd hoped this was "Lake FS, but with actual diffing". It sounds conceptually very cool, but as far as I can tell, the implementation is not ready for any serious data storage that relates to people.

3

tlklk t1_j0h2eak wrote

Interesting.

Can I store data exclusively remotely? Or do I have to have a local copy of the entire repo?

How does it compare to DVC?

8

rajatarya OP t1_j0h397b wrote

Thank you for sharing your concerns. We offer on-prem/private-cloud as a deployment option, specifically to help address some of these concerns. Meaning, we can deploy a single-tenant deployment of XetHub into your cloud environment (VPC) today. That should help with anyone geographically located in one place that isn't near our current deployment.

For teams that are globally distributed we offer cache clusters to allow for scale out and improved throughput, while minimizing costs.

I would love to hear more about your concerns - we are just getting started so lots more to come in the coming months!

5

rajatarya OP t1_j0h40s0 wrote

Tell me more about this. Are you are looking to push your data to S3 and then have XetHub ingest it automatically from S3? Or that you would like to keep your data in S3 and then have XetHub work with your data stored in S3 in-place?

We are planning on building the first one (automatic ingestion from S3) - it is on our roadmap for 2023.

Since XetHub builds a Merkle tree over the entire repo we don't actually store the files themselves - instead we store data blocks that are ~16MB chunks of the files. This allows us to efficiently transfer data while still providing fine-grained diffs. That means the files you store in S3 aren't represented in the same way in XetHub - so we cannot manage S3 files in-place. Instead we need to chunk them and build the Merkle tree so we can deduplicate the repo and store it efficiently.

Why would you want to be responsible for your own S3 buckets and files and then have XetHub manage things from there?

1

rajatarya OP t1_j0h532a wrote

Yes, you can keep data entirely remotely. We built Xet Mount specifically for this - just mount the repo to get a virtual filesystem view over the repo. We stream the files in the background and on-demand. Or you can clone the repo with --no-smudge and just have pointer files. Then you can choose which files to hydrate (smudge) yourself.

Comparing to DVC, we have a handy feature comparison available here: https://xetdata.com/why-xethub. The short answer is DVC requires registration of which files for it to track and does file-based deduplication by simply storing the files in a remote location. This means if 1MB of a 500MB file changed daily, with DVC/Git LFS every day all 500MB would have to be uploaded/downloaded. With XetHub only around ~1MB would have to be uploaded/downloaded daily.

Are you using DVC currently? Would love to hear more about your experience using it and have you try XetHub instead.

9

jakethesnake_ t1_j0h769x wrote

To be honest, I very much doubt we'd ever let a 3rd party manage our data. We have non-sensitive data on s3, and some more sensitive data on prem. My ideal would be a VCS which either leaves the data in place, or to a dedicated on prem deployment. For commercial sensitivity and data governance reasons, transfering data to a 3rd party is a non-starter.

I doubt a 3rd party storing a Merkle tree of the data would be acceptable to our partners either. We work with sensitive information.

That being said, XetHub looks useful for me and my team. I particularly like the mounting feature. Our distributed computing system uses docker images to run jobs, and I currently download the data as needed inside the image...which works but is not efficient. I'd much prefer to mount a data repo. I think this would solve some pain points in our experimentations.

I'm off work for the next two weeks, but I'll probably experiment with XetHub in the new year - cool stuff!

3

rajatarya OP t1_j0h7npz wrote

True :) I haven't used `git annex` myself so for me it felt like _finally_ when I could put all parts of the project in one place with XetHub.

How do you like using git annex? Are you working with others on your projects - does git annex help support team collaboration?

Again, appreciate the comment!

3

rajatarya OP t1_j0h8m7z wrote

Great, can't wait to hear your feedback once you've gotten back to work in the new year!

We definitely can do a dedicated (single-tenant) deployment of XetHub. That way your data stays in your environment for its entirety. It also means you can scale up or down the caching/storage nodes to meet the throughput needs for your workloads.

Yes, we built mount with the data center use case in mind. We have seen how distributed GPU clusters are at 3-5% utilization as they are sitting around idle while downloading data. With mount those GPUs get busy right away, we have seen 25% reductions in 1st epoch training time.

Small clarification - we store the Merkle Tree in the Git repo, in a Git notes database - so that lives with the repo. The only thing we store outside the repo are the ~16MB data blocks that represent the files in the repo that are managed by XetHub.

I would also love to hear about the data governance requirements for your company. Those can help us plan what features we need to add to our roadmap. Can you DM me your work email so I can follow up in January?

3

jakethesnake_ t1_j0hch5f wrote

Sounds great, I'll scout out XetHub in more detail when I'm back and DM you. Thanks for the helpful answers :)

re: data governance, we have signed very strict agreements with our clients. They specify where the data resides, who has access to it and a bunch of other stuff. I'm not invovled in those types of talks with clients, but the negotiations took months. A lot of care has been taken to meet these requirements, and adding another site and unvetted company into the mix is likely going to be tricky. This seems pretty standard for enterprise clients in my experience.

3

Liorithiel t1_j0hehga wrote

> How do you like using git annex? Are you working with others on your projects - does git annex help support team collaboration?

Right now I've got one large 5 TB repository with general media and archives, and some smaller project-specific repos. Slow with many small files (like, over 1 million), but very easy to set up. Haven't tried collaboration, I've mostly worked with projects where my collaborators were rather less technical. My main use case was working with the same dataset on different computers, and for that it was more than enough.

2

ZorbaTHut t1_j0hofzs wrote

If you can set up a Git client that's user-friendly enough that an artist can use it, you might be able to steal the entire game industry from using Perforce.

You've got half the work done, and, god, we would love to stop using Perforce.

Happy to talk to you if you're interested in learning more about this - there's an entire largely-untapped business opportunity out there :)

28

BossOfTheGame t1_j0i2v7d wrote

Why doesn't it scale past 1TB currently? I have DVC repos that are indexing on the order of 1-2 TB of image data. The data hardly ever changes, and if it does there isn't a big problem in just storing both copies of the image (in XetHub it would be the same probably because most of the image pixels would be different, depending on the processing level). All that we really care about is that it's content addressable, it has access controls (otherwise we would use IPFS), and you can distribute subsets of the data.

If I tried XetHub on a 2TB dataset would it simply fail?

2

rajatarya OP t1_j0iemk4 wrote

There isn’t a hard limit at 1TB currently. The main thing is the experience / performance may degrade. The size of the merkle tree is roughly 1% of total repo size so at 1TB even downloading that can take some time. You can definitely use XetHub past 1TB repo today - but your mileage may vary (in terms of perf/experience).

To avoid downloading the entire repo you can use Xet Mount today to get a file system readonly view of the repo. Or use the —no-smudge flag on clone to simply get pointer files. Then call git xet checkout for the files you want to hydrate.

I would love to talk more about the 2TB DVC repos you are using today - and believe they would be well served by XetHub. Something I would be eager to explore. DM me your email if interested and I will follow up.

Thanks for the question!

5

rajatarya OP t1_j0ij9if wrote

Would love to talk more. DM me your work email and I will follow up to set up time. We have heard of this use case and some of us (myself included) have used Perforce ~20 years ago.

One thing I would love to learn more about is the expected overall workflow. Meaning, what do game development teams expect their workflow to be? How does XetHub (or any other tool for code & asset management) fit into that workflow?

8

rajatarya OP t1_j0jx9cw wrote

No specific file limit. By scanning the files and chunking them the specific number of files in the repo doesn’t matter. But for each file in the repo we leave a pointer file that references the merkle tree entry for that file.

2

ZorbaTHut t1_j0kosbx wrote

DM sent!

I'm actually not sure how much of game-industry workflow is because that's the workflow we'd want to use, and how much of it is because we're stuck with Perforce and that's the workflow Perforce supports. But in general, it's:

  • You make changes and then check them in to the main branch. Everyone else is also doing the same thing.

If you're doing a release (assuming you're working on a game-as-service that's expected to have rolling releases), then often that gets branched into a Test branch and eventually a Stable branch. Besides that, branches are extremely rare and everyone is basically just working on the main branch all the time.

The reason for a lot of this is artists and designers. I love artists and designers, but complicated tools are just a non-starter for a lot of 'em; hell, there's programmers who have trouble with Git. Perforce's GUI is actually pretty clean and easy to use even for non-technical people and turns out to be a pretty good fit here.

The other thing to note is that any replacement needs to have file locking and Unreal Engine/Unity integration. File locking already exists on Git as a plugin, Git integration exists on those two platforms as well. But if you were building something that merely kinda looked like Git but wasn't actually Git-backed, you might need to do some work there. Unity's source control plugins can be distributed as .dll's, Unreal Engine probably requires that you submit a pull request to Epic (and contact the company first so you can get it through in less than half a year, they're pretty backed up.)

The reason we can't use Git is pure size. I've found it to be a surprisingly accurate rule of thumb that the final build of a game is 10% the size of a clean checkout. So, if your game is 10gb, the raw checkout is 100gb. The X-COM 2 full install was 78gb and it would not surprise me if this means the clean checkout is 780gb.

But that's just the clean checkout of the last version! I'm not sure what all the incremental updates look like, but a lot of that size consists of binary files that traditionally do not diff very well. I'm not sure if your system will be able to make this better, since many of these files weren't designed for it - it might be one of those deals where you move a single vertex on a model and the entire file gets changed because the designers didn't bother with binary stability.

Anyway, if you're including the entire history, it would not surprise me if this regularly gets another order of magnitude, and now the repo as a whole is passing 10tb.

Git in theory now supports doing partial syncs, so it can retrieve data in-flight from the server. Git in theory also supports partial tree syncs so you can choose a set of directories that you have in your filesystem, which would also be necessary.

In practice, I think this entire product looks like:

  • Take a very recent version of Git, make sure all of that works
  • Grab an open-source Git client, make sure it's usable by artists (including those partial sync features listed above!) and doesn't bog down when you're dealing with 10tb+ repos
  • Bundle this together (respecting licenses of course)
  • Go to a few mid-size-but-reputable companies, offer them your product for free if they'll test it for you
  • Fix all the issues they report
  • Start advertising it to studios as a less expensive and less painful Perforce replacement
  • If you've done the work right, you'll have a slow but steady takeoff as people gradually gain trust in this

I have no idea if this ends up being worth the effort for your company, but god, I wish someone would do it :V

2