Mon 13 Jan 2020 10:38PM

Mastodon scraping incident

NS Nick Sellen Public Seen by 60

You might have seen this stuff about the fediverse being scraped for some research (see https://sunbeam.city/@puffinus_puffinus/103473239171088670), which included social.coop.

I've seen quite a range of opinions on the topic, some consider the content public, others feel their privacy has been violated.

There are a few measures that could be put into place to prevent/limit this kind of thing in the future (see https://todon.nl/@jeroenpraat/103476105410045435 and the thread around it), specifically:

  • change robots.txt (in some way, not sure precisely, needs research)

  • explicitly prevent it in the terms of service (e.g. https://scholar.social/terms)

  • disable "Allow unauthenticated access to public timeline" in settings (once we are on v3 or higher)



Creature Of The Hill Tue 14 Jan 2020 8:40AM

I post public, and therefore expect anyone to be able to read it. If not there are other tools, way of tooting.

However, this is in the spirit of engaging with individuals. Publicly posting means that others can find and engage with me in the same way I have done with them. It feels safe enough, because of the tools afforded to deal with the rare negative interaction. Positives far outweigh the negatives in my n=1 case.

However, scraping feels very different, and quite negative. It actually has me thinking about what I post at the moment. I wrote a tool (pre-backups) to grab all of my toots so I didn't lose media etc... It would feel extremely intrusive/abusive if I used such a tool against another account to grab all their public toots. I know I could scroll and read them, but automation brings a level of potential abuse that makes it feel more uncomfortable.

So I guess I am in favour of dealing with it somehow.

Terms of service seems right, because those signing up should know where the instance stands. But that I personally think should be backed up by disabling public timeline access (v3 dependent). This would mean that although stopping someone determined will never be possible, we can make it eminently provable they acted deliberately and remove the defence of ignorance.

My two-penneth. Interested to see what others think.


mike_hales Tue 14 Jan 2020 10:03AM

Thanks for flagging this. I'm opposed to any actor scraping the entire sphere, for purposes of an analysis that will not be fully returned, mirror-fashion, to the communities whose behaviour traces have been systematically syphoned off . . by an industrial-strength (military-strength?) machine which is not in any way equivalent to the ordinary 'public' access of actual persons to actions of other persons-in-public. In a world with bots (and other assymetrical real world surveillance by un-public agents) some defence against this kind of violation of social norms is needed.

I may be missing something here but how do terms of service actually inhibit this kind of practice? Who's gonna sue? Is the fediverse or social.coop really going to take a violator to court? This is something I don't understand in general - so for example, I don't understand how Copyfair is in fact supposed to make any real difference to private abuses of the commons. Can someone explain the practical rationale for a terms-of-service/licensing approach? Who's it gonna stop?

So seems to me, some act-of-fiat is required, a machine-level fix: disable "Allow unauthenticated access" or modify <robots.txt> or whatever. I guess it's robot wars? Outside the law?


Nick Sellen Tue 14 Jan 2020 6:49PM

In this case the researchers have made efforts to comply with terms of service, from the paper:

In the terms of service and privacy policy the gathering and the usage of public available data is never explicitly mentioned, consequently our data collection seems to be complaint with the policy of the instance.

they also said they complied with robots.txt:

we have also respected the limitations imposed by the robots.txt files of the different instances

This type of case seems preventable, if that is desired.

If there was a truly hostile person doing the scraping I would imagine having those things in place would be a better starting position from a legal perspective, not that I know much about that.

I agree with the distinction between ordinary public access by actual people and machine enabled public access, especially when you include the ability to analyse the data with current and future algorithms, which is an explicit aim of theirs:

The usage of this dataset empowers researchers to develop new applications as well as to evaluate different machine learning algorithms and methods on different tasks


Nick Sellen Wed 15 Jan 2020 1:39PM

@mike_hales my comment above this one, to me, partly answers your real-world query - in this real-world case having these things in place would have been able to prevent it (for hostile cases it would increase the effort required to scrape the content).


mike_hales Wed 15 Jan 2020 2:17PM

I don't get it Nick. Aren't these just documents, protocols. Protocol observers will . . observe them. What effort does it take to not-observe them? And if a document has quote-unquote legal force . . legal force costs a lot of money to mobilise. Freedom under law is very skewed. I truly don't see how such things can be seen as practical defences, for distributed or digital commons, against determined abusers.


Nick Sellen Wed 15 Jan 2020 2:20PM

I agree for determined users, but for these particular ones, they were doing it in good faith that it was permitted and acceptable, and presumably would not have done otherwise.


Bob Haugen Wed 15 Jan 2020 3:02PM


I don't understand how Copyfair is in fact supposed to make any real difference to private abuses of the commons. Can someone explain the practical rationale for a terms-of-service/licensing approach? Who's it gonna stop?

This is not directly about scraping data, it's about open source software licenses. Large companies with legal departments do not knowingly violate licenses. Which is why lots of companies will not use GPL code. I would expect universities might also want to avoid legal liabilities even if nobody is going to sue them.

Won't deter malicious actors, though...but FB and Goog getting sued by European government agencies for lots of money might put a crimp in their plans...


mike_hales Tue 21 Jan 2020 10:11AM

Just a thought on ‘good faith’. From the analysis in the letter of protest that has been written in the fediverse, it seems clear that the researchers were not acting in good faith at all. Rather, they seemingly acted in a pretty crass, ignorant way, didn’t do what they said they did, and weren’t aware of half the things they should have been, if they were fully literate users. So expectations of good faith were no protection in this case.

In something that’s quite technically complex like this, I might expect dumb ignorance to be a pretty widespread possibility (including in fields of casualised, precarious employment in academia), and expectations of good faith to be no defence against harm. Legal action and compensation after the harm is done isn’t a substitute for defence?

Scholar.social seems to be the act to follow on this?


Nick Sellen Tue 21 Jan 2020 10:41AM

So expectations of good faith were no protection in this case.

Indeed, I was too optimistic about that I think, but I still feel it was perhaps just badly implemented good faith ;)

... but the legal side seemed more successful, in that the dataset got removed from where it was hosted due to the legal basis.


Nathan Schneider Tue 14 Jan 2020 4:48PM

I don't see any problem with scraping of public posts. That's one of the wonderful things about public microblogging; it's a resource for public research that helps us better understand ourselves. If you don't want to be scraped yourself, you can set your posts to private. I think it's within the values of Social.coop to welcome our public data to be available for study.

A colleague of mine has worked on this issue of user perceptions of research quite a bit. Some resources:



Load More