Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The problem is that if you train an ML model with a bunch of data that happened to be available in the past, then the system will perpetuate the same biases as were inherent in the training data. This leads to the (real issue) Google image classifier categorizing an image of a black man as a "gorilla" etc.

Certain words are heavily loaded and are worth just skipping to avoid all the hassle for now.



Btw, the gorilla incident was overblown. Overblown in the sense that people from other races (including whites) were also classified as some hilarious animals.

Gorilla and black just was the most politically charged one of the bunch.

(The other potentially politically charged one was some tendency to misclassify people of various levels of body fats as various animals.)

> Certain words are heavily loaded and are worth just skipping to avoid all the hassle for now.

If memory serves right, that was Google's pragmatic solution: if they detected a human in the picture, they 'manually' suppressed the animal classification.

So they lost being able to classify 'Bob and his dog' in return for not accidentally classifying a picture of just Alice as a picture of a seal.


[flagged]


No, not at all.

I see it as little more than GPT-3 having a list of words like "cunt", "fuck" and "shit" and realizing that there is little to be gained in including these words right now, so skipping them makes sense until we figure out some more urgent things first.


It’s not censorship; it isn’t muzzling you. Microsoft is choosing not to emit speech on this topic.

It is a deliberate and voluntary omission, not censorship.


Microsoft is censoring itself. Which they are allowed to do.

I am censoring myself, too.


If you insist? I suppose


[flagged]


>That's just reality.

Except it really isn't. If the datasets used, truly represented everyone in the world, that would be a reasonable argument. The point is that right now, the most cheaply available and voluminous data sets online tend to have a whole bunch of examples from western nations and far fewer from other parts of the world, for the simple reason that historically most of the people taking photographs and sticking them on webservers were from those places.

"Reality" doesn't have the same statistical anomalies as these data sets (e.g. there are a hell of a lot more people with brown skin in the world than are included in common training data), so "that's just reality" really isn't a strong argument.

This is a very, very common problem in ML and isn't limited to politically-charged words. For example, in some of the earliest attempts at using computer vision to detect tanks in an image for military purposes, the photographs with tanks in them all had different lighting than the ones without tanks and so the (super simplistic ML model) overfit the model based on a bias in the data. Unless the data set is truly representative, you'll often get biases in the resultant model.

> If you have your own set of politically correct answers hardcoded by a team of blue haired people you're not doing machine learning.

Well, this is just silly. We all know Deepmind has a policy of only allowing green hair dye on campus.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: