Loomio
Mon 13 Jan 2020 10:38PM

Mastodon scraping incident

NS Nick Sellen Public Seen by 61

You might have seen this stuff about the fediverse being scraped for some research (see https://sunbeam.city/@puffinus_puffinus/103473239171088670), which included social.coop.

I've seen quite a range of opinions on the topic, some consider the content public, others feel their privacy has been violated.

There are a few measures that could be put into place to prevent/limit this kind of thing in the future (see https://todon.nl/@jeroenpraat/103476105410045435 and the thread around it), specifically:

  • change robots.txt (in some way, not sure precisely, needs research)

  • explicitly prevent it in the terms of service (e.g. https://scholar.social/terms)

  • disable "Allow unauthenticated access to public timeline" in settings (once we are on v3 or higher)

Thoughts?

NS

Nick Sellen Thu 16 Jan 2020 11:15AM

robots.txt is a static file included in the repo, see https://github.com/tootsuite/mastodon/blob/master/public/robots.txt (or for our current version), so not configurable within the instance, or per user, but we could choose to have our own one to override the default. I didn't manage to find an instance that has customized it though, so would need some research, maybe a question to #mastoadmins would come up with something.

D

Django Wed 15 Jan 2020 6:12PM

Just to be clear Opting out of Search engine indexing is insufficient to prevent scraping.

NS

Poll Created Wed 15 Jan 2020 7:21PM

Put a Peer Production License on Social.coop tweets Closed Sat 18 Jan 2020 7:02PM

Alongside any technical provisions we add about mass scraping of our data, I propose that we should place a peer production license on our content, restricting reuse to nonprofit and cooperative entities. (Of course, we can offer separate licensing to other entities on an ad hoc basis.)

Using the PPL would also be a way of extending solidarity to the broader co-op movement.

https://wiki.p2pfoundation.net/PeerProductionLicense

Results

Results Option % of points Voters
Yes 90 9 LS N JB NS M DM M D COT
No 10 1 AW
Undecided 0 86 DS ST JD CZ BH F NS SH KT C G AM MSC CCC MC SC PA RB MN JG

10 of 96 votes cast (10% participation)

N

Noah Wed 15 Jan 2020 7:42PM

Yes

Without getting into the broader questions about licensing that Aaron has raised, I think a reasonable amendment here might be something along the lines of, "All toots covered by PPL unless specified otherwise by the user - check their profile"

NS

Nick Sellen Thu 16 Jan 2020 10:44AM

Yes

sounds a good experiment in this license, the link above is broken, and hopefully this one will work - Peer Production License - I tried reading that page, but it's a bit long and full of dense walls of text :/

AW

Aaron Wolf Fri 17 Jan 2020 6:36PM

No

mixed feelings and am open to changing my mind, but I'm skeptical of the PPL. I support co-op solidarity and the intention of the PPL 100%. But I'm critical of discriminatory licenses. I prefer PPL over CC-NC because blanket anti-commerce is even worse. But plain copyleft, CC-BY-SA would accomplish what I see everyone talking about here: getting anyone doing research to publish the research under free terms we could all access.

FWIW, I would like to mark my posts CC-BY-SA

NS

Nick Sellen Thu 16 Jan 2020 10:59AM

I wanted to explore more what the authorized fetches option is about, the Mastodon 3.0 in-depth blog post gives this explanation (for Secure mode, which I presume is the setting that the toot I read before was referring to):

Secure mode

Normally, all public resources are available without authentication or authorization. Because of this, it is hard to know who (in particular, which server, or which person) has accessed a particular resource, and impossible to deny that access to the ones you want to avoid. Secure mode requires authentication (via HTTP signatures) on all public resources, as well as disabling public REST API access (i.e. no access without access token, and no access with app-only access tokens, there has to be a user assigned to that access token). This means you always know who is accessing any resource on your server, and can deny that access using domain blocks.

Unfortunately, secure mode is not fully backwards-compatible with previous Mastodon versions. For this reason, it cannot be enabled by default. If you want to enable it, knowing that it may negatively impact communications with other servers, set the AUTHORIZED_FETCH=true environment variable.

Given we are not on v3.0 yet, maybe we can just wait until then to decide. It might be possible to assess which servers we will not be able to communicate with were the setting on...

M

mike_hales Thu 16 Jan 2020 1:16PM

@Nick Sellen great to have that greater depth, thanku. On that basis I'm happy to switch to a YES vote. Roll on v3.0!

D

Django Mon 20 Jan 2020 3:31PM

Thanks for expanding on this, I had made an assumption about this based on some toots. And as @mike_hales pointed out, more info was needed for the informed decision.

NS

Nathan Schneider Thu 16 Jan 2020 4:57PM

@Nick Sellen sorry about the bad link. Here's a nice article on the PPL.

M

mike_hales Sun 19 Jan 2020 9:55AM

Clear positive vote on Put a Peer Production License on Social.coop tweets. But only 5% turnout. This needs tooting? A second vote? Its own thread? Authorized fetches is heading the same way. These need much more participation?

D

Django Mon 20 Jan 2020 3:32PM

Agreed!

Maybes the official @SocialCoop@social.coop account could announce the polls/discussions to the instance users.

Should we re-roll the 2 polls into 1?

M

mike_hales Mon 20 Jan 2020 11:09PM

@Matthew Cropp or @Matt Noyes or @emi do Would you announce? But @Nathan Schneider @Django need to float the polls again - new thread to lessen confusion?

MN

Matt Noyes Tue 21 Jan 2020 2:29AM

How about this? @Nathan Schneider and @Django combine the polls in one then announce it together, with an toot from the social.coop account as back up. I am happy to encourage people to participate.

NS

Nick Sellen Sun 19 Jan 2020 10:43PM

There is an Open Letter from the Mastodon Community, via https://sunbeam.city/@GwenfarsGarden/103507032332626576 which says they are asking if people want to co-sign.

Interestingly it points out they did not abide by the terms of service, and did not sufficiently anonymize the data. The dataset has been pulled from https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/R1HKVS due to legal issues.

M

mike_hales Mon 20 Jan 2020 11:11PM

Interesting thorough letter, worth filing. Closed now for signatures.

W

Wooster Wed 22 Jan 2020 8:00AM

Any solution short of a technical measure preventing the actual scraping of posts (such as only permitting friended authenticated users to read your toots) will not stop your toots from being scraped and harvested, along with any identifiable information that is available.

Put succinctly, if you make information on the internet available to people without authentication, it can and will be scraped. Regardless of any laws, letters, privacy statements, terms of service, strongly-worded posts or anything else. The researchers and scholars who make their intentions public may make some effort to abide by these guidelines and attempt to redact personal information, but the actors who may be really doing things you'd rather them not with your data will not be so obligated.

Don't post stuff on the internet if you don't want it to be public information. There's no social mechanism that has enough force to prevent others from accessing it in an automated fashion.


If you want to post things on Mastodon that others can read, Secure mode will likely break that capability. Secure mode does not prevent scraping, it merely allows you to see who is doing the scraping, which can be an anonymous user. Either your toots are public or they aren't, authorized fetch doesn't do anything to prevent scraping. Anyone can set up a new Mastodon instance and create a HTTP signature to make authorized fetches. The Fediverse, like Twitter, is not a place for posting anything you wish to be private in some fashion. Either people and machines can read your content, or they can't. No open letters or policies will change that.

Normally, all public resources are available without authentication or authorization. Because of this, it is hard to know who (in particular, which server, or which person) has accessed a particular resource, and impossible to deny that access to the ones you want to avoid. Secure mode requires authentication (via HTTP signatures) on all public resources, as well as disabling public REST API access (i.e. no access without access token, and no access with app-only access tokens, there has to be a user assigned to that access token). This means you always know who is accessing any resource on your server, and can deny that access using domain blocks.

Unfortunately, secure mode is not fully backwards-compatible with previous Mastodon versions. For this reason, it cannot be enabled by default. If you want to enable it, knowing that it may negatively impact communications with other servers, set the AUTHORIZED_FETCH=true environment variable.

NS

Nick Sellen Mon 3 Feb 2020 10:10AM

The researchers and scholars who make their intentions public may make some effort to abide by these guidelines and attempt to redact personal information, but the actors who may be really doing things you'd rather them not with your data will not be so obligated.

Yup, that's my feeling too, I think we are mostly limited to that first category, but that that is still very useful to me (e.g. enough to have the dataset pulled from this public harvard database https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/R1HKVS), there is a clear difference to me between having all this data/content in official and public datasets, compared to secret and illegal datasets.

For the second category, the bad actors, we can just make it marginally more difficult (that they actually have to have an account or instance, the details are unclear to me), seems worth considering the pros and cons and letting the membership decide. But because of the compatibility issues with authorized fetch mode, I voted no on the check, as I'd like that more informed discussion first, plus assessing how many of the servers we federated with would be impacted, maybe over time enough instances are upgraded sufficiently that it's not an issue.

Don't post stuff on the internet if you don't want it to be public information. There's no social mechanism that has enough force to prevent others from accessing it in an automated fashion.

I think it would be also worth making this clearer to people, mastodon is not a private/secure messaging platform.

M

mike_hales Wed 22 Jan 2020 10:13AM

It's good to have it stated as plainly as this @Wooster thanks.

if you make information on the internet available to people without authentication, it can and will be scraped. Regardless of any laws, letters, privacy statements, terms of service, strongly-worded posts or anything else

As I said earlier my concern in this is not privacy: I personally operate in the fediverse with an awareness that careless talk is as unwise here as it is anywhere. My concern is with the uses that are made of material in the commons. I'm concerned that there shoud be concerted efforts to build real commons of digital media, and that any analysis that is made on materials in commons should be returned to those who created the materials, and notified to them - as I wrote here. This is a big ask and how to do it is unclear. But potentially this gives us the means of a much more embracing awarenes of who 'we' are and what 'we' do . . Silicon Valley oligarchs and State security agencies are not the only people with an interest in knowing the shape and dynamics of our behaviour 'in the large'. This is a kind of literacy that's become possible in the past generation, and it's time it was seriously attended to.

One of the things that's most difficult in getting started on this, is that the ethos of commoning is different from, and tangential to, the basic ethos of the web and free software. These latter are built within an anarcho-libertarian culture of autonomism and complete privacy of and control over individual property. This orientation has brought some very powerful tools and technologies, and there are more in the pipeline - open data, mesh networks, open app ecosystems, whatever. But commons are post-propertarian. They're built within a culture of stewarding, curating and enjoying in which all participants have the same access to the same means, under the governance and policing and common aesthetic of them all. Commoning is associationist rather than libertarian and individualist, and the peer-to-peer culture of free software production - a world of protocol-commons - is a space where the two cultures have an awkward coexistence, which is far from resolved.

From the standpoint of building commons, the persistent concern with privacy is a sideshow and maybe a distraction, and the main game is finding ways of policing and ending extraction from commons, and facilitating and mandating return of value to the commons. It's no less urgent (though less of a life-and-death matter) to start focusing this in digital commons, than it is in the wild commons of air, water, energetics and biosphere. Digital data is one of the 'new wildernesses'; cowboys, frontiersmen, gunslingers and homesteaders are out there (where are the posses and deputies? who shot the sheriff?); and so are industrial-scale, robber-baron, clear-felling, cash-cropping, land-grabbing, financial-capital giants. It's the kind of steampunk world Neal Stephenson might write, but it just so happens that we're in it?

PS: I think the open letter, and the stance of scholar.social - is still interesting. The slack, extractive, ethos of academia certainly needs attending to. They (we - I used to be one) need to learn new ways of being in, and serving, communities that are not basically running on academic-elite, publish-or-perish, knowledge-commodity rules.

DS

Danyl Strype Sat 4 Apr 2020 10:55PM

@mike_hales

the basic ethos of the web and free software ... are built within an anarcho-libertarian culture of autonomism and complete privacy of and control over individual property.

I'm sorry Mike but this is a myth, one that sows confusion and division within the digital commons movements (and IMHO was crafted to do so). People like Adam Curtis and Fred Turner who propagate this just-so story about the origins of personal computing and the net are either confused about the history, or being knowingly deceptive.

If you read the founding documents of the GNU Project and the FSF, it's very clear that the motive is to create a software commons, to protect people's ability to share and cooperate in their use of computers. TBL's earliest descriptions of the web were about the benefits of bringing documents out of the individual silos on people's computers, and similarly sharing them as a commons where everyone can build on each other's work. Same with other early web media projects like Indymedia and Wikipedia. Even EFF founder JPL's 'Declaration of the Independence of Cyberspace', often referenced as the canonical example of this perceived Silicon Valley Randianism, says (emphasis mine):

"It is an act of nature and it grows itself through our collective actions."

"Your legal concepts of property, expression, identity, movement, and context do not apply to us."

The rugged individualist "libertarian" discourse came later, after the invention of HTTPS allowed for "e-commerce", making the net interesting to capitalists. This accompanied (perhaps even led to) the privatization of much of the internet's infrastructure, such as the commercialization of the DNS system, and the rise of silos like Farcebook that use web browsers as a universal UI, but don't respect web standards as an open, shared platform.

Fudging together the hacker ethos represented by Stallman, Berners-Lee, and Barlow, with the corporate apologism of Silicon Valley, is not only wrongheaded, but quite frankly it's deeply insulting to those of us who carry the torch of the former, and utterly repudiate the latter.

AW

Aaron Wolf Sun 5 Apr 2020 12:07AM

Spot on Danyl. I was at LibrePlanet 2015 where an audience member was questioning Richard Stallman about how we should trust the "government take over of the internet" with the net-neutrality FCC stuff. Stallman answered like this (my recollection, video is available if I want it perfect, but it's not important):

> Entities that I trust have told me this approach to net-neutrality through Title II is overall positive. And pointing out the problems with one regulation to oppose all regulation is nonsense. It's like saying "there was a BAD law, so therefore we should not have laws." Oh, and maybe there's some misunderstanding, that people think I'm an anarchist. But I'm not, I think we need governments for many important things. In fact, I have a PRO-STATE gland!

A primary motivation that RMS had in founding the free software movement was to have the sort of community collective that he experienced at the MIT AI lab. He saw proprietary software as undermining a sharing, collaborative society. And his politics are basically Green Party views, quite different from libertarians. And many others in the movement share those views, though not exclusively.

Here's what RMS says about so-called-libertarians: https://stallman.org/glossary.html#anti

M

mike_hales Sun 5 Apr 2020 7:13AM

Thanks @Aaron Wolf @Danyl Strype Its good to get this affirmation. Regarding the myth . . it’s not that I’ve read this in malicious narratives, which Strypey identifies. It’s something I’ve observed. As a relative outsider to hacker culture, and latecomer, what I do see is a whole lot of libertarian ethos. But yes, the framing as cultural commons is utterly - well, sufficiently -different, and entirely welcome. The commitment of commoning is a deeply transformative one, beyond state and market, beyond consumerist individualism and supremacy of any kind. So it s good to see these affirmed as also being deep threads running in the FOSS (or should I say opensource?) world.

I say ‘also’ bcos both are de facto presences in the now massive forces of code production and use, and its a struggle. Origin myths - “back in the day, in the unwalled garden, it was like this” - are comforting, and it’s important to have them as counter-stories. But they don’t change the present reality, which is that it truly is a struggle to claim the ground for the commons (and not reclaim, since this developed ground of internet and platforms and data oligarchy that exists today never has yet been in the commons?)

DS

Danyl Strype Sun 5 Apr 2020 7:56AM

@mikeh8

FOSS (or should I say opensource?)

Up to you, but FWIW Stallman prefers "free software" or "software freedom". I like to use "open source" to describe the development methodology, and say "free code" to describe the outputs.

The commitment of commoning is a deeply transformative one, beyond state and market, beyond consumerist individualism and supremacy of any kind.

I agree. Free code, and open source practice, were designed from first principles to be a commons approach to software development. Proprietary software is the market approach. I'm not aware of a state-driven approach.

“back in the day, in the unwalled garden, it was like this”

My point is that the digital commons never went away. It's grown continuously since the GNU Project was founded. If it hadn't, neither Loomio nor the fediverse would exist. The Silicon Valley anti-socialist ideology is parasitic on that commons. It's an artifact of the VC parasitism on the goodwill associated with "open source", as is the promotion of "source available" proprietary licenses, see: https://mjg59.dreamwidth.org/52907.html

AW

Aaron Wolf Sun 5 Apr 2020 4:03PM

I use the term FLO as in Free/Libre/Open because it's all those aspects (and more). See https://wiki.snowdrift.coop/about/free-libre-open

The fact is that people have gotten enough real-world experience to see that the vision is possible. Wikipedia is probably the best example in being uncompromising, completely FLO, community-run, public-facing. It has its problems, nothing is perfect. But it is a proof-of-concept.

The antisocialists (to use Stallman's term, which I like) certainly exist and many are indeed drawn to FLO tech because it doesn't directly conflict with their ideology. There are then many heated debates within FLO between pro-social parts of the movement and the "libertarian" and pro-corporate parts. It's not just FUD, these issues are real, and I've encountered them too.

My platform co-op (still working toward launch) exists specifically to address these things. The most supported FLO is that which serves corporate ends. We need to solve coordination problems and cooperate in order to fund public-focused, downstream public goods. That's the mission of Snowdrift.coop and I would greatly welcome and appreciate your participation, feedback, questions etc. We have a thorough wiki and our own forum etc. And we have done the research on the whole space.

FLO really did start from pro-social foundations, but once Open Source process showed enough dramatic success, it got co-opted. I think this is the best overview: https://mako.cc/copyrighteous/libreplanet-2018-keynote

These are deeply serious political challenges. Snowdrift.coop aims to address coordination around funding but many other elements are needed for the movement to succeed. It is dire right now, not a success yet.

DS

Danyl Strype Sat 4 Apr 2020 9:28AM

Last year, the journal 'Information, Communication & Society' published a special issue called 'Locked Out', critically examining the ways the walled garden nature of corporate social media platforms has accelerated the problems of online mis/disinformation:

https://www.tandfonline.com/toc/rics20/22/11

One of the issues they raise is the way those platforms are protecting themselves from accountability by preventing researchers from accessing data, using "privacy" as an excuse. It's seems to me that the fediverse community is getting sucked into a manufactured moral panic around this, mistaking privatization for privacy.

Scraping of every piece of public-facing work on the web is totally normal. It's how all search engines work. It's how the Wayback Machine works. What's the difference between scraping the public discussions on public-facing Discourse forums for a search engine index, and scraping the public-facing discussions on the fediverse (or any social media platform) for discursive research? I'm reminded of the debates in the EU about 'Freedom of panorama'.

If you don't want your statements to be recorded for posterity, say them privately. AFAICT it's as simple as that. Lots of people seem to think every conversation is a nail because they like their Mastodon hammer so much. But there are plenty of free code tools that have much better tools for private discussions, even within the fediverse; Diaspora, the Zot apps (Hubzilla and Zap) etc.

AU

Ana Ulin Sun 5 Apr 2020 8:53PM

The comparison with search engines is a good one: One expects a search engine to respect robots.txt directives. Crawlers that systematically disregard robots.txt directives to disallow and noindex typically get their IPs and User-Agents blacklisted. It is not a matter of what is technically possible (one can't technically enforce robots.txt directives, if the pages are still accessible on the web), but what is accepted as good etiquette and good-will behavior.

Similarly, the fediverse has developed an etiquette around respecting #nobots, and thus it is a reasonable expectation on a Mastodon instance to have that be respected.

Yes, everyone posting publicly should be aware that anyone can see their toots and those could get scraped, screenshotted or whathaveyou. But that does not mean that posting publicly gives anyone the license to aggregate and re-use my content without my consent, anymore than the fact that postcards are open gives the postman permission to make copies of all the ones I receive and post them in the local paper.

DS

Danyl Strype Mon 6 Apr 2020 11:24PM

The comparison with search engines is a good one: One expects a search engine to respect robots.txt directives.

The researchers concerned did that.

But that does not mean that posting publicly gives anyone the license to aggregate and re-use my content without my consent, anymore than the fact that postcards are open gives the postman permission to make copies of all the ones I receive and post them in the local paper.

A DM is equivalent to a postcard. A public post is equivalent to a
poster on a public wall. You can send a letter to the editor and get
grumpy with people for archiving copies of the newspaper it's published,
or using them for research purposes. If you don't want your posts
treated as published works, you can post them as DMs, even in group
conversations where everyone else is posting publicly.

There are a plethora of tools for private social messaging, even within
the fediverse. Diaspora was one of the earlier examples, allowing users
to give access permission to only one person (like a DM), some people
(like group DMs), a group of people defined by the posting user
("aspects"), or everyone (public). Friendica does private messages with
DFRN and Dispora protocols, and maybe now ActivityPub? Hubzilla and now
Zap have been doing federation of private content with Zot, later AP.

You can argue with technical reality all you want, but there's no
changing the fact that if you walk around with no clothes on, everyone
can see your junk. The solution is to get dressed, not waste your time
and other people's trying to control how other people's eyes work.