Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Josh: Get the advantages of a monorepo with multirepo setups (github.com/esrlabs)
246 points by oftenwrong on July 15, 2021 | hide | past | favorite | 135 comments


IANAMU (I Am Not A Monorepo User), but as far as I understand monorepos and their advantages, I'm not sure what the use-case for this tooling is.

Most of the time, when an org chooses to move to having a monorepo (rather than just being left with one by accident of history), the key advantage they're striving to attain, is the ability to make changes to cross-cutting concerns across many distinct applications/libraries, with single commits/PRs. To change an API, and all of its internal callers, atomically, without having to worry about symbolically binding the two together with dependency version constraint resolution.

Which is to say, the key advantage of a monorepo comes from having the whole monorepo checked out locally.


When I've used a monorepo, that was one of the explicit goals.

Avoiding "here's my new library version, go see if it breaks your shit" was the goal - you make a change, you run the tests, you see if the whole company's code can still build or not. Having fully-separate projects in directories in the monorepo using published dependencies was considered an antipattern (though it was very hard to keep some teams from doing that).

The disadvantages of the resulting monorepo weren't "this directories are so big to keep checked out when I'm just working on one specific project" it was "our old build times and build tools are dying under the strain and even trying to move to a 'monorepo friendly' build tool might be an intractable problem because our dependency graph has become such a mess of spaghetti."

A monorepo that was done well from the start so you don't have the slow-spaghetti-build problem from months or years of "oh it's easy to depend directly on this full other module, let's just do that" sounds very appealing. We just didn't pull it off in practice, and this project would ... maybe... help in the early stages by letting people have more restricted checkouts? But only if you already know what you're doing anyway.


You seem to mix two different goals here. Atomic commits across everything surely is a feature of a monorepo which a manyrepo setup cannot replicate. However, the "test whole company's code" could also be achieve with a central package manager and many repos. That is what Amazon does: http://beza1e1.tuxen.de/amazon_manyrepo_builds.html


Yeah, there's two different aspects in there (though without the full-blown testing making broad atomic commits isn't super attractive to me anyway. ;) )

Amazon's Brazil system doesn't sound like it does the "test the whole company's code" aspect in the same way.

> If you create a package at Amazon, you specify an interface version like "1.1". As long as changes are backwards-compatible, the interface version is not changed. When Brazil builds a package, it appends an additional number to turn it into a build version like "1.1.3847523". You can only specify dependencies on interface versions.

So if I'm working on a non-backwards-compatible version, my users aren't going to pull it down immediately in the build (if I properly call it version 1.2.x), so the build won't show which of them it broke.

Even if you depend on the latest version instead of pinning to a major version, how would I force a company-wide build of everything that depended on me using my not-yet-landed branch? I can run those tests in a monorepo on my branch directly, I don't see how I'd do it in a manyrepo setup without a very specific set of "every project in the company specifies its dependencies as depending on specific, easily-overrideable, commit SHA versions."


Actually to do that, one was is to just make 1.1 as 1.0 in a test branch to do the test build and build it against everything. There are other ways to do this as well, depends on what you're doing.

But it's important to note amazon doesn't monorepo like you are used to thinking, so it is actually on everyone to take the updates on the schedule they've set via automation or if it's an emergency via ticketing / policy. You are merely seeing if it's going to work, the automation does the rest, possibly over weeks.

And I am simplifying things here to make it easier to understand. Understanding brazil from not using it, is about as difficult as understanding git rebase coming from CVS, eg there are concepts you didn't even knew existed that you have to understand how to use to operate at large scale.


The basic problem with trying to solve the problems with versioned releases with monorepos is that the code in the repo and the code in production are not the same. If you change your service’s API, just updating all of the other clients of your service in the repo is not sufficient to skirt the problem of different clients expecting different APIs.

Re the tooling issue, I’ve frankly never seen it solved. I presume Google’s done it, because they’re a bit messianic about the whole thing, but outside of them, I suspect all monorepo implementations fit into the “Communism” model of working great in theory.


> If you change your service’s API, just updating all of the other clients of your service in the repo is not sufficient to skirt the problem of different clients expecting different APIs.

Most Google APIs are written in protobuf, which is backwards compatible over the wire. This makes most changes painless, although caveat emptor as things can still get messy as they do in the real world. With something like a Java interface, it doesn't really matter if those APIs change because all the code is built and released together at HEAD.


Backwards compatible from a parsing-this-message-won't-blow-up standpoint but not necessarily from a semantic meaning-of-each-field standpoint if your devs are ... clever ... enough. ;)

That discipline to commit to an API and meaning for each field and not keep tinkering with it in messy ways is an under-valued and under-evangelized aspect.


> With something like a Java interface, it doesn't really matter if those APIs change because all the code is built and released together at HEAD.

Right, this is the thing that winds up following from the monorepo - the monorepo problem goes away if we just release everything, the whole fuckin’ system, into production every time we make any update to master. If that’s your system, hats off to you, I’m not nearly that good of an engineer.


It's actually easier, in many ways -- if you can manage to release continuously. It forces you to stage your code incrementally and you should be able to roll back things easily if you see breakage.


If your services are backwards compatible to talk to each other, then you don't need to release instantly. Making breaking changes is tricky, but you don't need to release everything.


I don't understand the problem. Code structure that listens to dev/prod mode can handle the different APIs. Branches to mark prod vs. dev.

Can you give a more concrete example of the problem? I don't think I'm understanding.


The pitch I often hear with monorepos is “When you change your service/library/whatever, you can simultaneously change all of the clients of that library/service/whatever in one PR”. The problem is a PR is not a release - just because the code in the repo is up to date doesn’t mean the code in production is. So, for example, if your service changes to require an extra parameter in the API, sure, you can update all of the clients of that API to provide that extra parameter, but that still requires them to be built & released as well - just updating the code doesn’t actually fix the deploy/release coordination problem.

Short version, you’re right, coordinating releases with feature flags, branches, etc. solves this problem, but it solves the problem in a multi-repo world too, and a monorepo doesn’t obviate the need for those solutions as many of its proponents seem to suggest.

There are probably legitimate uses for monorepos, but an awful lot of people seem to position them as a silver bullet that lets developers stop having to worry about coordination in a multi-service world, and that’s just not how it works out.


I agree that it doesn't improve things at the service interface level. What I have found is that I often have several internal libraries, and a monorepo ensures that I know if a change to my library breaks another service because they're all in one build. Keeping things consistent across services is very useful in my experience.


The issue comes if you are communicating between different binaries (for example, RPC requests or message handlers). Just because you can change the code atomically doesn't mean you can deploy the client and server atomically.

As someone pointed out though, proto mitigates a lot of this.


IATAOJ (I Am The Author Of Josh) ;)

You are absolutely right about the main motivation of using a monorepo: Allowing upsrteam library maintainers to see downstream usage of their code and make the required downstream changes themselves at the same time they change their libraries.

Also like you say the easiest way to get those advantages is to just check out the monorepo locally, so if there are no other reasons preventing you from doing just that, go for it.

However there are a few reasons why this is not always sufficient:

Size: The repo might be so large that cloning it all will makes local tools (git cli, guis,...) slow to use, or in the most extreme case require to much disk space for your machine. To address this there are some git native tools like partial clone and sparse checkout, so size alone is not really the the main issue for us.

History "pollution": Having a lot of somewhat loosely related projects in one tree means a history that shows all the changes. Yes git can filter them, but once again that might be a performance concern, but once again not really the biggest motivation to create a new approach/tool.

Permissions: In some organisations (like the one I work for) it is not possible to give all developers access to all the code and thus the advantages of monorepo get lost just by trying to comply with data protection standards. The only solution with native git is to split the repo at legal (not necessary technical) boundaries and try to coordinate the changes across those. Loosing most of the benefits described. Josh does not have a full blown permissions system yet, but the concept certainly allows for it and implementation is work in progress.

Sharing with others (aka, distributed VCS): This is the biggest motivation for using something like Josh. The partial repos are repos in their own right and all the distributed features of git can be used with them. In a monorepo setup as you describe distributed workflow is sacrificed for monorepo advantages. Only developers in the same monorepo see the same sha1s and can easily exchange changes. In Josh the same library can be part of different monorepos at different organisations and while the monorepos have different history and therefore sha1s, the “projected” or “partial” library subrepos will have compatible history with identical sha1s. In this way Josh can serve as a bridge between organisations using different repo structures.


This is a really interesting approach. How would you layer CI logic on top of this? Given your example workspace josh file,

    dependencies = :/modules:[
        ::tools/
        ::library1/
    ]
how are the canonical build artifacts for, say, ::library1/ determined, and how are they presented to the workspace?

I understand that the partial repo layering is the key innovation that exists a layer below what I'm talking about, but I'm trying to understand how you can ergonomically layer never-build-twice logic on top of it.


For the CI Josh is only used to determine if a given commit affects a given workspace. This can be done server side using the Josh GraphQL API. Having this understanding about the dependencies of workspaces understood by the vcs server(in this case Josh) means that such a query can be executed server side before any git clone/fetch needs to happen and in our case also before the CI allocates a machine to do the clone/checkout.

What artifacts are to be build inside a given workspace is totally up tho the build system(s) and tools that work after the files have been checked out to a working copy at which point Josh is not involved at all.


The ability to do some changes like that is not the same as doing it always. Most commits are quite localized, and those should not be penalized by the ability to have a few cross-cutting ones.


I think there are some potential applications.

For example, you want to have a project in the monorepo, but also share it outside of your organisation, and even have external contributions brought back in.

Or, you are working in a project in the monorepo, but you don't want to (or can't) check out the entire repo. You can still checkout just that project and its deps.

Or, you are working in a polyrepo-using organisation, and you want to experiment with using a monorepo while continuing to allows devs to work in the polyrepos that it is composed from. No big-bang cutover necessary.


Yes that it exactly the idea. Translating between both mono and poly repo, partial sharing with others (distributed development) and gradual adoption without a big bang.


They are very useful when you've got hundreds of engineers working on distinct projects that are loosely connected. Imagine you have a team working on core libraries, a few product platform teams, and finally a team working on a customer facing feature. The feature team commits to master and a build kicks off (build 1), that build fails for an unrelated issue (say the build node dies) and the build kicks off again (build 2). With multiple layers of libraries supporting the feature team, its entirely possible that a dependency could have changed and the end result of build 2 would be different than build 1.

When each commit shares a common timeline it is really easy to rebuild build 1 with the exact same dependencies.


Git submodules already solve that problem, though. As does publishing all the core libraries etc. as language-ecosystem packages on a private package namespace or internal corporate package repository, and then resolving/locking the language-package dependencies to specific tags/refs with a lockfile that gets committed to the downstream repo.

These are the "obvious" solutions to this problem, the first ones the average software architect would reach for. What would lead them to ignore these options and choose a monorepo instead, if not for what I mentioned above — the ability to make atomic changes to cross-cutting concerns?


Git submodules have an annoying attribute: double commit

First you commit/push to one repository

Then you update the submodule pointer in the parent repository


I've resisted - for too long? - submodules as I've always heard they can be problematic; that the actual implementation isn't quite what I should be.


>> These are the "obvious" solutions to this problem, the first ones the average software architect would reach for.

They would have been, except the experience working with them was somewhat problematic for those who have been down this path, and the message that many devs have received is to avoid them. I kind of agree with going to submodules, but confidence doesn't seem to have been rebuilt.


From the title I expected it to be a tool for treating multiple separate repos as though they were all just one single monorepo. But from the description in the README, it seems to be for treating subsets of a monorepo as though they were separate repositories.

PS: The title, in case it is changed, is currently “josh: Get the advantages of a monorepo with multirepo setups”


That was what I expected from the description as well. After reading the readme it's not clear to me what problem this is trying to solve, and why this is the solution.


The problem is that large codebases tend to have a huge footprint if you need to clone the whole repo. Git as-is does not allow you to only pull a subset, i.e. specific paths representing a sub project. That's what josh is trying to solve: a "virtual" repo that behaves like a real git repo but behind the scenes seemlessly integrates with the big monorepo.


I believe git provides that functionality through sparse-checkout. You can clone a repository without checking it out, then use sparse-checkout to only pull the paths you want.


sparse-checkout only reduces the number of files copied from the local repo to the working directory. It doesn't affect the amount downloaded data. For that you need shallow and partial clones (shallow clones give you a subset of history, partial clones give you a subset of the files within that history). Partial clones especially are a relatively new and not heavily used git feature.


Partial clone with the --filter feature seems complicated to use. You need to use a bunch of command to set it up and then it looks like you still need to be careful while using it.

I'd dream of somethong as simple as

   git clone --partial foo/bar https://example.com/some.repo.git
And then everything would work normally.


Your dream came true ;) This is what Josh does:

  git clone https://example.com/some.repo.git:/foo/bar.git
And then everything works normally.


disclaimer: I haven't seen for myself what benefits monorepos actually provide so I don't fully grok them.

This is the kind of talk about monorepos that makes me think they are a bad idea. Why would someone want to maintain a monorepo and then pretend it's not a monorepo? Not only just pretend it isn't, but invest not-insignificant time on the problem of pretending it's not a monorepo?

I am immediately thinking of the horribleness of how some of the (older) javascript frameworks re-invented the back button (and browser history in general) instead of.. ya know, using the browser.


One big advantage of a monorepo is that when you check out the tree you automatically get the versions of all the files that work together (assuming there's some CI!). If you want to refactor an API you can refactor its callers easily and check the whole thing in. Etc.

With each project having its own repo, then you have to track the fact that Foobar 2.2 works with baizo 1.6-1.8 but not more recent versions.

Also conceptually it's easier when you are working with the client and the server at the same time, or the two mobile apps, and so on.

Of course people manage without this when the project has stuff that doesn't fit in a software repo (CAD designs, artwork, etc...there's a reason why that POS Perforce survives, for example). Solidworks has its own proprietary RCS that doesn't work with anything else.

IMHO if the project is relatively small (say <500K LoC) a monorepo is almost always the way to go. But with a big project it breaks down.


It’s basic git that breaks down, not the monorepo model.

And it’s less about LoC, and more about the number of files and how much binary stuff you put in your repo (and how often it changes). Git is really bad when binary data is involved.

Git has the facilities to keep monorepos clicking along (shallow clones and sparse checkout) but they aren’t along the “happy path”


That's why there's git-lfs[1] (large file storage)

Keeps the binaries out of your repo, replacing them with pointers

[1]https://git-lfs.github.com


And many git repository hosting services, like bitbucket, have limits on how large your repository can be. Sometimes you can upgrade these limits. This often leads to fear of exceeding this limit and can lead to one repository per module.


That’s an interesting claim considering Google, Facebook and Microsoft run monorepos. Heck Apple does too internally although just for the build team (snapshots of each project submitted to them, but it all goes into a mono repo)


I think the parent comment meant that multi-repo breaks down for large project, but made a typo.


Google’s monorepo isn’t complete, excluding various things like Android. Also it isn’t git


>IMHO if the project is relatively small (say <500K LoC) a monorepo is almost always the way to go

A typo? You seem to mean that multi-repo is the way to go :)


Its simple. Monorepos allows(does not force, nor guarantee) you to make bigger atomic changes to many projects at once.

You can update a library and all the downstream projects in a single commit. There's no race condition or caching problem of pulling an update without pulling/seeing the dependency update. You don't need to wait for dependency artifacts to build and propogate.

You can create a turn key build script that will build the world from source. You can skip any local artifact storage like Artifactory. You don't need to pull multiple repos in a serial fashion, no dependent pulls. You can structure your codebase such that if you pull one commit it can have no other dependencies.

The draw back is Git happens to not make it easy to pull just one folder. Other things like Perforce make it trivial.


When you have shared resources between two services (such as React components), you have 2 choices: have a separate shared repo that will need to have its own versioning, and keeping it in sync with development is a pain if multiple people are working on features that touch both shared components and individual services at once, or have a monorepo where the service and shared components can just be worked on through the same repo.

The same story is true with things like APIs or types where two services need to stay in sync.


> I am immediately thinking of the horribleness of how some of the (older) javascript frameworks re-invented the back button (and browser history in general) instead of.. ya know, using the browser.

They did that because the browser didn't support adding to the history via JavaScript.

But even now that the browser does support adding to the history via Javascript ... is that really just "using the browser"? At some level in many modern web apps back button history is not just the browser. This isn't an ancient thing left behind with old frameworks.


Monorepo is an alternative to having binary dependencies with a registry scheme such as npm or maven (at the organization level). It's essentially only workable with tooling support none of us has (unless you work for Google or one of the few other shops that have said in-house tooling). It isn't a workable approach using stock git or github (but that won't stop people trying nor claiming to the contrary).


I wouldn't want to do it at Google's scale without Google's tooling. But my experience has been that, at the scales I've worked at (no more than a couple million LOC across the organization), the limiting factor isn't source control, it's the build system. Maven, for example, doesn't really understand monorepos, so it can be a bit difficult to figure out a how to implement a policy for deciding what needs to be built when that's less heavy-handed than, "build everything always."


Whereas what I was kind of hoping for was something that works like svn externals.

(No, git submodules are not it.)


I apologise for the title. HN has a short limit for title length, so I came up with my own title. I thought this title did a decent job of presenting, using short language, the main application that the authors gave top-billing in the README.

I am not affiliated with the project.

JOSH claims to be reversible, so it could be used in either direction, which is where the multiple use cases come in. Treating subsets of a repo as their own repo, or treating multiple repos as one. I would say there is some application overlap between this and git submodule/subtree/subrepo and also tools like copybara.


You might want to check out https://github.com/asottile/all-repos :)


Google's tool repo kind of does what you thought this does


Is that for android build system only? West is a better one I think https://github.com/zephyrproject-rtos/west#basic-usage


No, repo is not tied to AOSP or the Android build system.

As an example, Chromium [0] is a non-AOSP project that also uses repo.

[0] https://chromium.googlesource.com/chromiumos/docs/+/HEAD/dev...


Like some people, I was expecting to find a way to have the advantages of a monorepo while having projects in separate repos. This is something Bloomberg is doing, and it's very cool. Each project is a separate repo, but they have a central integrated "repo" with all the repos, which is the "source of truth", and were code is built and deployed from. You can commit changes in your repo, and then you "release" the code into the integrated repo, which will rebuild all the transitive dependencies and run their tests to make sure everything still works. If anything fails, your release of the code is not merged in the repo. I'm now working with a monorepo, and I much prefer the Bloomberg approach. Cross repo changes can be made atomically (you update the reference in the integration repo for multiple individual repositories at once), and that is usually the big sell point of monorepo. And it doesn't have the downsides of the monorepo. The only issue is that it's not very ergonomic, and there isn't a tool to make that easy. But building such a tool is definitely easier than implementing a virtual FS as it has been done in multiple companies.

I'd love if someone still working there were to write a nice post about that system, it was the first of such a kind I saw.


> Each project is a separate repo, but they have a central integrated "repo" with all the repos, which is the "source of truth", and were code is built and deployed from.

That's exactly one of the things josh can already do for you :) Josh's concept of workspaces is precisely this: define your dependencies (no matter where else they originate from in the monorepo) and then check out only those dependencies, along with any code that solely exists in the workspace. Your workspace checkout is effectively the "bloomberg"-style repo setup you described, as you only see your code and the code of your dependencies, but when you push, your changes get added back to the hidden, backing monorepo (the source of truth) where all related and pertinent tests are run, and your change can only be committed if those tests all pass.

Thus, your commit is your "release". Sure, you don't have exactly the same workflow, as there's no difference then between a commit and a release, as by your definition you don't release after every single commit, but the whole "release this change to everything else in the monorepo" is touted as one of the benefits: there's no massive integration headache if you have multiple breaking changes which you then need to work on resolving for everyone else.

source: I work directly with "chrschilling" - who wrote josh


But that only works in one direction, no? So it works if you can develop your single repo, but it doesn't help you when you depend on other projects.

I think the best approach would be to have bidirectional links between the projects (if A needs B, then A has the stable version of B and vice versa). The point in that setup would be that "upstream" projects can notice when they are about to break tests in "downstream" repos and act accordingly.


They do, the repositories define dependencies, so when you change something everything that depends on you is rebuilt and tested. This prevents breaking changes, both for your dependencies and your reverse dependencies, identically to a monorepo.

It's a bit complicated to explain, but it works. That's why I hope they'll make a blog post :)


> I'd love if someone still working there were to write a nice post about that system, it was the first of such a kind I saw.

I don't know how far back you saw the Bloomberg system, but at this point it's basically the same as the Debian system (as in, debian/ subdirectories, .deb files, etc.). Versions of git projects are published as tarballs (source packages). Then sets of published projects are "promoted" and all projects that transitively depend on them are rebuilt and unit tested in a sandbox environment. If that process fails, the promotion fails.

Each source package can use any number of build systems, implementation languages, or project structures.

There's also a legacy subversion monorepo with a monolithic build system that builds on top of that, but it's slowly being phased out.

All that is an integration build including thousands of discrete projects. Those projects typically have additional CI/CD enrollments outside of the integration build system too.


This is what I saw. But I think it would be good to share the experience of a big, mature company with it to show that it's a system that can work, and can be easily added to companies which currently have a multi repo approach.

Also, there are quite a few tools to manage the distributions, and it would be great if they were open sourced. Basically Bloomberg championing their approach, to gain the usual advantages of open source (developer familiarity, cooperating across companies for improvements, and so on)

> Those projects typically have additional CI/CD enrollments outside of the integration build system too.

Another thing to call out is that you can simulate the "promotion", so you can check in your PRs whether your change is going to break any dependency or dependant.


I've come to learn that git monorepos are becoming quite popular. What I don't understand is why people are using git for this kind of workflow. It forces you to actively work against git's design goals and implementation. Which then compels the use of several odd workarounds and kludges to kind of seemingly reassemble a half-baked flavor of subversion. Why not just use a tool designed around monorepos and subtrees? I'm genuinely curious. I assume I'm missing something.


The problem is that the alternatives are much, much worse.

There's nothing about the git/mercurial object models that makes them intrinsically inefficient with monorepos.

What's inefficient is materializing the object database (when cloning) and the working copy (when checking out), when you're only going to need tiny portions of them.

Subversion doesn't have the first problem (but comes with extremely slow history operations), but sparse checkouts don't really solve the second because you have to statically know what to filter.

A better direction is instead to virtualize the filesystem, so you get the semantics of a real monorepo with git/mercurial, and you fetch only what's actually needed without any change to your tooling (it just needs to interact with the filesystem). It's also very easy to transparently implement caching and prefetching this way.

This is the approach that Facebook and Microsoft took, and I believe Google too.


Though it is interesting that Microsoft's own efforts have been moving away from the virtual filesystem approach and back towards making sparse checkouts better/more reliable and better/more reliable support for partial checkouts (of history especially) and better memoization and caching of history reachability information (git commit-graph).


> It forces you to actively work against git's design goals and implementation.

It works against GitHub's design goals. You could work on multiple, logically distinct projects in the same git repository easily, if you so choose. git was designed for Linux's workflow.


> What I don't understand is why people are using git for this kind of workflow.

People like to use distributed workflows even with monorepos. E.g. chains of commits, branches, rewriting local history, etc.

It's clear that people want a mixture of monorepos with distributed workflows. There are two ways to get there: add distributed flows to a monorepos, or build a monorepo layer over a distributed tool.

Both seem like valid approaches. The market will decide which approach it prefers.


A big chunk of the industry joining in past decade has never seen/used anything but git, so it's the one hammer they have available.


Is that even a bad thing? If people are familiar with git --- its command syntax, its commit model, its collaboration setup --- why not let them keep using this model? Isn't it better for everyone if we put a few hundred people on targeted scalability fixes for git instead of making a few million people drop productive work and learn a new tool?

I mean, most of the people in industry today have known no character encoding except ASCII and its supersets --- and that's a good thing!


Git, while better now, it still allows people who are just familiar to shoot themselves in the foot.


So? It's the standard. Anything else starts with -1000 points. Something like hg might be better, but not better enough to be worth the cost of breaking uniformity in the industry.


I never get the footgun argument. If you need a gun, you're liable to shoot yourself. Programming in general is more and more accessible to wider audiences, but it hasn't become intrinsically easier.


What are these tools designed for monorepos and sub trees?


Probably the most popular one would be the "new hotness" that everyone used, or aspired to use, before git became popular, Subversion (https://subversion.apache.org/).

There are others that aren't free (e.g. Perforce) and some that aren't quite dedicated to the monorepos & subtree workflow, but which handle it better by design (e.g. Darcs, http://darcs.net/).

But, mostly I'm thinking of Subversion.


haha yeah, noones going to go back to subversion. The sum of the pain of the SVN problems solved by git is much larger than the slight pain of using git for a mono-repo.


The reason I would use a Git monorepo over Subversion is that Subversion is really painful to use. I would rather workarounds and kludges than that tire fire.


That was never my experience with Subversion, but YMMV.


Subversion on the server used with git-svn is bearable. Most of the advantages of git (local commits, rebasing, easy branching and merging) manifest themselves on the client-side.

The problem with using svn on the server is the lack of good tooling for things like code review. There's no SvnHub or SvnLab.


Perforce?


perforce helix core


No other DVCS is as popular. Git has a vast ecosystem of tools. No one wants to drop that.

Its certainly not against the design goals as official tooling supports shallow and sparse checkouts, they're just experimental features.

We should just strive forward and continue to make these features good.


Isn't git designed around the linux kernel repository, which is a monorepo?

I get the impression that things like LFS, submodule and subtree are hacks added onto git to try to make it behave like something it isn't.


I think this is quite exciting, as it solves a major unsolved problem for large git monorepos: enabling development or CI/CD inside a git monorepo without requiring a large checkout.

As monorepos grow huge, this comes to be very costly or even prohibitive, and companies like Google simply don't use Git.

Here are some problems with alternative approaches that have been mentioned:

* VFS for Git: I believe abandonded by MSFT in favor of improved client-side tooling: https://github.com/microsoft/VFSForGit/blob/master/docs/faq.... .

* Sparse checkout: limits ability to use a build system to dynamically find any dependencies and rebuild them

* Submodules: can't atomically update both the parent and the child repo, have to manually update the referenced commit of the child repo in the parent repo, and each collaborator must manually update their child repo when the commit changes


VFS is being replaced in favor of https://github.com/microsoft/scalar


Which scalar is "mostly" just a config tool for git sparse checkout of git partial clones with git commit-graph support turned on. All of that is stuff contributed directly into the git client.

Beyond that "mostly", it also configures git lfs, which likely will always be a git plugin and not directly in the client and the rest of it seems like stuff Microsoft is testing before upstreaming it directly into the git client.


> CI/CD inside a git monorepo without requiring a large checkout.

this can also be solved by using a git mirror.


Doesn't this just provide another (perhaps more nearby) remote?

You still end up doing some kind of large checkout.


Nope, this brings all the git files and puts them on a nearby storage. The remote (as in "git remote") remains the same. You still need to get lots of files but most of these files are nearby.

(We got our CI checkout time from 40+ minutes to well under half a minute this way.)


How does this compare to partial clones and sparse-checkout? This question was raised as an issue in the project but was closed as "not really an issue", which I guess is true.

https://github.com/esrlabs/josh/issues/23


Technically correct, but so unhelpful. No way I'm using a project that has this kind of community engagement.


Generally, Github Issues are used as a bug tracker, not a community FAQ. Asking a maintainer to compare and contrast their project with another project or git feature seems a bit demanding.


Keyword: generally. Plenty of projects do allow community questions, specially small ones or early stages. Is there anywhere else to ask that question? If there is, it isn't prominently signified, answering at least "ask this >here<" would be common sense. At a minimum, this issue evidences the need of documentation and should be addressed in some way with more than "I don't have to answer this".


GitHub Discussions is a much better place to handle Q&A: https://docs.github.com/en/discussions


Which didn't exist when that issue was created.


And none of the people in this thread did provide even a hint of an answer either...


It's not a trivial question, so it's not that surprising that nobody answers it without prior knowledge of the thing?


There is now an answer to the question in https://github.com/esrlabs/josh/issues/23


I wouldn't even agree that it's not an issue. It's something the README doesn't cover.


A lot of the comments here are surprisingly dismissive. I think having the ability to project parts of your git repo (still with a normal git api!) is an incredibly useful feature. Take the example of DefinitelyTyped: the maintainers can just do all the things they want in one repository which _vastly_ reduces the development overhead, but consumers of that code can use it however they please. If I understand correctly, you could have a submodule reference work out of the box but to a subset of that repo that you care about - seems pretty damn cool to me!


What you are describing is one of the main use cases at ESR Labs (where Josh was created): For developers it is very convenient to work in a single tree. For reviewers and CI it is useful to look at the changes in a larger context. For consumers/integrators however it is useful to only look at parts of the code that have to be shipped to particular customers, as submodules(or the like) in their repos. Plus a lot of package managers assume library == repo as a default, so it is also easy to integrate with those while keeping monorepo processes for development.


This can be kinda achieved with git submodules.

I’m currently using a monorepo with submodules and it works really well.

Dependency management is not a huge issue too, at least if you don’t have hundreds of submodules.


It can not - josh can do many arbitrary kinds of (reversible) transformations on the repo, which allows you to have different external "projections" of your monorepo.

Imagine a company that develops strongly interdependent software in a monorepo, but needs to publish different subsets of this software to external entities which also expect a coherent version history.


git filter-branch would fit this use case?


On a basic level, yes, both Josh and git filter-branch do essentially the same thing. The difference being that Josh is much faster not just compared to git filter-branch but also compared to all the other similar tools out there, especially when run repeatedly in the same repo.

Also being a server it does not require any installation or resources on the developers machine.

In addition to that over time more features where added that git-filter branch does not have, most notably "josh workspaces" which is a DSL for repo transformations.


The problem with trying to force one single history, is that it ignores deployments. And user onboarding/behavior change. All of which can be relevant when working on a project.

With multi repository projects, this helps some thinking, as it is clear that the changes were not atomic between systems. They are literally separate at all layers, including the commit.

I sympathize with wanting a simpler view. I'm just worried on an inflated value proposition.


> The problem with trying to force one single history, is that it ignores deployments.

I'm not sure what, exactly, the problem is that you're talking about. When you deploy something, it's built from a specific commit.

Deployment is not atomic, it's a gradual process that takes some amount of time. Between when you (or your automation) chooses to deploy a system and when the deployment finishes is some window of time. The system will often spend much of that time in a partially updated state. You may also choose to canary changes, so you will have a mix of different versions in production at any given time. At companies where I've worked, the time from deployment start to finish for backend systems ranges anywhere from hours to weeks.

I don't understand how this relates to multi-repo or mono-repo concerns, however. The repo is a history of the source code (intentional changes by humans), it's not a history of the state of your production systems (which are the results of automation).


Many of the mono repo guides I see are off the "all projects in one repository" kind. If the project doesn't build and deploy as an atomic unit, then having atomic code changes is a foot gun.

It is amazing how many times I've seen folks think that just because they can get build time tests happy with changes in two projects, that they can safely send out the two changes.

Does a multi repo "solve" this? Of course not. But it is easier to reason that two projects clearly need two deploys. Versus having to remember that one commit could be N project deployments.


> ...then having atomic code changes is a foot gun.

Why is it a foot gun? I don't know what the negative consequences are here.

Atomic commits are just to make development easier. It means that you can refactor downstream dependencies in the same commit that you make a change to an upstream library. This way, you either build and deploy the new library + the refactored downstream dependency, or you build and deploy the old version, but never some mix. It reduces the number of possible configurations that can be built & deployed, since you can only pick from a point in one repo's history, and you can't mix and match various points in the history of different repos. With multi-repo, it is harder to discover down-stream dependencies.

> It is amazing how many times I've seen folks think that just because they can get build time tests happy with changes in two projects, that they can safely send out the two changes.

Build-time tests don't catch 100% of errors. Some errors will still make it into production. I don't see how this problem is related to the multi-repo vs mono-repo problem at all. At places where I've worked, if you change project X and project Y, and your commits pass the build test, you still have to submit the commits in some linear order. The pre-commit tests for X will include Y or vice versa. X or Y will be rebased or merged on top of the other one, and the result will have to go through pre-commit tests.

I am just trying to understand what problem multi-repo is supposed to be addressing here, and I don't have the slightest clue what you're getting at.


I call it a foot gun because of how many times I've seen people shoot themselves in the foot.

My assertion is that the "build and deploy new code with updated downstream users" only works in a minority of cases. Now, I grant that this could be due to the micro service nature of where I'm at. And I also grant that it is nice when this can work. However, the times it goes wrong are usually not at all worth the risk.


> I call it a foot gun because of how many times I've seen people shoot themselves in the foot.

I understand what "foot gun" means. Explaining what "foot gun" means is not helpful. What I don't understand is the problem that "shooting yourself in the foot" is a metaphor for.

> My assertion is that the "build and deploy new code with updated downstream users" only works in a minority of cases.

There are two main cases here: libraries and services.

Libraries can be atomically updated no problem, most of the time. You change a function and fix the call sites.

For services, you have to add to the API, deprecate the old thing, refactor the clients, and then remove the old thing after all the clients have been rebuilt and redeployed. At least two steps.

What I don't understand is how this would be different for multirepo or monorepo setups. In either case, removing some old piece of functionality requires waiting until the clients have been redeployed. Using new functionality requires waiting until the server has been redeployed.

> Now, I grant that this could be due to the micro service nature of where I'm at.

The teams where I've used monorepos are also the teams with the most buy-in to microservice architecture. One team I was on ran a service that consisted of something like twenty different microservices. These interacted with services run by other teams. Everything was in the same monorepo. It worked fairly smoothly, as I recall--we spent most of our time solving domain problems and working on our team's core mission, and I don't remember any problems arising from the monorepo setup.

It sounds like your experience is different, and I was hoping that you would share some of that experience.


I have seen people review and ship code that was all consistent in a single commit, that had to be deployed in separate deployments. In the mixed fleet scenario that this leads to, I have seen failures that require gymnastics to fix.

Yes, you can do similar with reviews that span multiple projects. However, I have seen it happen far more times in single repositories than I have in multiple ones.

Again, I am not in any way offering a panacea. I'm just saying that seeing things as atomic at one level leads people to think they are atomic at the next level. And this is a mistake I've seen many many times.

Yes, I have seen people manage it somewhat well. But every effort I have been involved with that tried to merge code history between projects has had more faults of this kind than the other projects I have been on.


Ok, that's a good example. I've just never seen that stuff make it past code review. Teams I've been on, development & deployment are either done by the same team, or they're done by two teams who sit next to each other. Reviewers are also very skeptical of larger changes.

I still don't see advantages to multi-repo here. Even within a single project, or single service, deployment is often not atomic. If you make a change to service X and deploy it, you end up with minimum two versions of service X in deployment until the deployment finishes rolling out.

So we have automated tests that cover that case... each service has to work correctly when combined with not only the repo HEAD, but also with the currently deployed versions (perhaps more than one! in the case I'm thinking of, it was only ever the two previous versions, but other teams had longer horizons) Testing against multiple versions is a bit more expensive, so it's only done as a pre-deployment check, rather than a pre-commit check. You cut a branch for deployment, and when one commit doesn't play nicely with previous versions of the service, you cherry-pick a commit to revert the faulty commit, and run the tests again.


Same situation here. And we have always had tests, too. Hard to force mixed fleet behavior in most tooling, though. And teams grow, such that relying on code review to catch things is really a best intentions.

Putting it in your face that you are changing two projects is about the best I can offer here. I would cede that this, too, is mainly a best intentions. But having as close to 1:1 between code changes, deployment changes, and code reviews at least puts things on mainly equal footing.

That is, when the explanation of why some code can't deploy together is that they are in two projects, and you can see that by them being separate repositories, that feels easier than knowing that two parts of a single repository have to deploy separately.


It just seems like such a miniscule benefit, and you have to deal with the massive headache of a multi-repo setup.

If you don't want people making changes to multiple projects at the same time, it would be trivial to add a pre-submit check to enforce that anyway. There's no need to switch your entire repository layout just to remind people that the code in different folders belongs to different projects.


I will fully cede questionable ROI on this whole exercise. But, I will argue that cuts both directions.

That is, I would not push to move from one form to the other. I do like multiple projects being as independent as I can make them, though.


I might be reading this wrong, but it seems like this creates a second place where a project would need to express dependencies between modules.

You would need to do it for JOSH and whatever your build tool is. You'd probably need some additional git commit hooks to ensure your build tool of choice config is in sync with josh.


> ... this creates a second place where a project would need to express dependencies between modules.

You're right, but from my experience with using josh it really hasn't been a pain point. This, however, is coming from the context of having a build system which (appropriately) checks out all relevant josh workspaces as part of the monorepo verification build, and builds all of those workspaces along with building the relevant parts of the monorepo. Thus, if someone adds a new dependency in a workspace project build, the main monorepo verifier build will stop them from committing that change without also making sure the dependency is added to the workspace file, because the associated workspace build part of the verifier will fail (due to the missing dependency).

Additionally, if you're working in a workspace and you add a new dependency, your own local checkout build will fail due to that missing dependency until you add it to the workspace file (and then get that new dependency). As long as you have builds and tests covering your workspace, it's pretty easy to figure out if you forget to add the dependency to the workspace file.

Lastly, at least from my experience, it's overall been a pretty inconsequential price to pay, as new dependencies aren't added frequently after a project finishes its initial start-up phase.


There are a few (re)solutions in the wind. The latest one that I've known is `west` (part of Zephyr-RTOS project), but I haven't tried yet.

There may have wrong description (FIXME) but a sort list is found here: https://github.com/icy/git_xy#why


This seems to provide the same functionality as using Git Sub Modules[1].

Am I getting the correct impression?

[1] https://git-scm.com/book/en/v2/Git-Tools-Submodules


Assuming I understand what this is doing correctly, it does the reverse of a submodule.

git submodules let you tack a second git repo onto an existing one. For example repoA tracks `repoB@version1234` at path `/foo/bar/baz`.

This on the other hand lets you take monorepo and checkout `/go/mysubservice` as a "repo" and treat it as its own repo. Then when you do git pushes etc, it translates the changes into the larger monorepo.


I don’t see what problems this solves, but I do see plenty of problems it could introduce. There are tremendous benefits to sticking with the tried and tested approach supported by git rather than introducing yet more tooling.


They definitely need a long FAQ. I'm sure it's different than submodules somehow, but in what ways/circumstances/purposes, I have no idea.


I have yet to hear anyone give me a coherent explanation of why a monorepo is better.


Before the traditional HN comments "isn't this just ...?", the README already does that for you:

> a blazingly-fast, incremental, and reversible implementation of git history filtering


That looks nice. It must have some major drawbacks? Sounds too good to be true…


It’s definitely more complex than just using a monorepo. This tool that all your code runs through is young and not well supported.


Having been to the both sides of it I can say there exist exactly 0 advantages of a monorepo setup.


At our startup, we chose to start with a monorepo. Our team is small, but one of the big advantages we’ve had so far is avoiding the n*m (for n services and m tools) problem with dev tooling - which leads to a very smooth developer experience.

For example, to run one or more services locally, we use a single script that sits at the repo base - ‘dev.sh service1,service2,...’. This avoids a lot of headaches for our developers, as we enforce compliance when adding a project to the repo. Lint config? One to rule them all. Test coverage thresholds? Single one. This consistency is the biggest win in my opinion.

Similarly, our integration tests are very easy to write without commit skew.

Finally, sharing libraries has been painless - since we have common/ and common/third_party/ directories at the monorepo root.


Another benefit (imo one of the biggest ones) is not having to constantly create releases across all your repos and manage their states when jumping around during development.

My previous job used a collection of about 6 repos for different services and such, and it was a constant struggle to ensure the correct versions were used in development - especially if you were working on a bigger feature that wasn't yet released but required "future" versions from other repositories.


Exactly this.

Why would one want to do n pull/merge requests, n separate reviews (of related code), and n deliveries for a single evolution, is something I can't understand.


I have also been on both sides and I'll say that each side has different sets of advantages highly dependent on a host of factors a non exhaustive list of which is:

* Culture

* CI/CD tooling support

* Codebase sizes


Simplicity is always an advantage.


You can have your simplicity one of two ways:

* Devops - one repo, one deploy, let the developers figure out which repo is which.

* Developer - one codebase, let the devops people sort it out and write lots of tooling to make my monorepo work.

If this stuff was easy, everyone would be a developer.


Most of you don’t need a monorepo, the same way most of you don’t need, well, half the shit peddled in the tech Instagram (conferences, meetups, mediums, blogs, hn).

You just don’t need that stuff, there’s like 20 of you on a team and at best your app probably sucks and barely has users, and if it does have users, it’s probably some trivial bullshit.

You’re all a bunch of ordinary folks, so stop fucking up the workplace with your identity crisis. No, you are not an elite engineer, you are Bob, the guy who goes home every day and watches Netflix/plays video games.


Surprisingly, a monorepo is much easier for smaller teams and individuals to work with.

As someone who has been maintaining a React and Django app solo for the past three years, two repositories or more is too much cognitive overhead to work with.

Never doing that again. Monorepo are easier for small apps.


Putting stuff in the same repo is a pragmatic idea. Going into the monorepo isolated self contained publishable app/package is a whole ‘nother thing, along with all the tooling necessary to make it work seamlessly.

You want to put stuff in the same repo, that’s fine. What’s with all the other bullshit?


The nice thing is, you can replace "monorepo" with "multiple repos" in your comment and it's just as believable a statement arguing for avoiding trouble with coordination/packaging etc.


Well, it’s something that needs to be said about a lot of things. Life is a balance and now days in tech I see the pendulum swinging way too far to the other side.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: