Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

re spacy-transformers. I really wouldn't recommend it. I tried using it but was a nightmare. They had a dependency on a previous major version of Thinc (spacy's NN backend) but removed the documentation for that version. I wasted a week trying to deal with it until I gave up and went pure pytorch.

Spacy v3 seems to have integrated the package functionality, so I'd go for the nightly release instead of this.



Sorry you lost time on this!

We took a long time to get Thinc documented and stable, because there was a long period where I wasn't sure where I wanted the library to go. The deep learning ecosystem in 2018 was pretty hard to predict, and we didn't want to encourage spaCy users to adopt Thinc as their machine learning code if we weren't sure what its status would be. So we actually never really got Thinc v7 stablised and documented.

This actually became a real issue in the previous version of spacy-transformers. It meant we were pushed into a design for spacy-transformers that really didn't work well. The library wasn't flexible enough, because there was no good way to interact with the transformers at the modelling level.

Pretrained transformers are interesting from an API perspective because you really don't want to put the neural network in a box behind a higher-level API. You can use the intermediate representations in many different ways, so long as you can backprop to them. So you want to expose the neural networking.

Thinc v8 was redesigned and finally documented earlier this year: https://thinc.ai . We now have a clear vision for the library: you can write your models in the library of your choice and easily wrap them in Thinc, so spaCy isn't limited to one particular library. For spaCy's own models, we try to implement them in "pure Thinc" rather than a library like PyTorch or Tensorflow, to keep spaCy itself lightweight (and to stop you from having to juggle competing libraries at the same time).

So, it's not quite true that we removed the docs for Thinc v7. We actually didn't have a good solution to do the things you needed to do in the previous spacy-transformers, which prompted a big redesign.


Hey thanks for the super detailed response!

Yeah I was trying to do something that didn't quite fit with the spacy-transformers API at the time. I did get a bit of a headache trying to use thinc at the time, which was just when you guys did the redesign I think, so the docs were different from what I was seeing. I might not have searched enough though.

I didn't try it yet, but it seems that transformers got added to spacy v3 with first class support.

I did gain something from rummaging though spacy source though! NN layers were composed into module-like pieces, then added to this REGISTRY variable though a decorator. That way some things could be defined at runtime. It was super elegant.

I nicked the concept of that for my data preprocessing pipeline. Saved me a lot of time when trying new things.


No worries, and glad it wasn't a total loss! Yeah the registry solution is something we've been very happy with.


What would I miss if went all transfomers without spaCy? I don't get the idea of a wrapper API through spaCy.

I'd like to be as close as possible to the core transformers API without any intermediate layers. Nothing against spaCy but also when looking at huggingface's side and all the pre-trained models... it feels that nobody talks about/uses spaCy if they use transformers already.


I think spaCy offers a lot of things to connect the models to the rest of your application.

spaCy's Doc object is pretty helpful for using the outputs, for instance you can iterate over the sentences and then iterate over the entities within each sentence, and look at the tokens within them, or get the dependency children of the words in the entity. The Doc object is backed by Cython data structures, so it's more memory efficient and faster than Python equivalents you'd likely write yourself.

I also think our pipeline stuff is a bit more mature than the one in transformers. The transformers pipeline class is relatively new, so I do think our Language object offers a better developer experience.

I think the new training config and improved train command will also be appealing to people, especially with the projects workflow.

The improved transformers support in v3 is very new, it's only just released in beta form. I do hope people find it useful, but of course no library or solution is ideal for every use-case, so I definitely encourage people to pick the mix of libraries that seems right to them.


Missed this news, thanks! OP if you wish to use spacy try v3.

https://explosion.ai/blog/spacy-v3-nightly




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: