Social.coop

Policy on scraping and automation?

Eliot Lash Mon 21 Nov 2022 6:28AMPublicSeen by 103

I recently came across this exchange:

https://social.coop/@jsit/109377986123385626

https://equestria.social/@fastfinge/109378722026162194

I feel like fastfinge has a point here. I don't want people to start thinking they need to defederate from social.coop, so a timely response from the official account seems warranted, even if no decision can be made immediately on how to handle this situation. Has something like this come up before?

If not, I think setting some sort of policy in our CoC would be important to do.

Ed Summers @edsuMon 21 Nov 2022 10:55AM

There has been some discussion and work on this before: https://www.loomio.com/p/spnFIuOg/add-a-nobot-provision-to-the-federation-abuse-policy which resulted in a clause being added to the Federation Abuse Policy If this user ignored #nobot when training the model I think that would be grounds for a block? But it's hard to know what they did :-( I wonder if there are log analysis tools to monitor traffic from particular IP addresses that we should consider using to be more pro-active about this?

Joseph AndrianoMon 21 Nov 2022 12:47PM

I don’t like scraping. I don’t think we should encourage it. But it doesn’t appear to be against our CoC (so long as the bot didn’t follow anyone, which is what #nobot indicates). In addition, the “I would have to suggest that instances defederate you” is a threat, and an empty one at that. I think this opens the question of whether we should create a scraping policy, but I don’t think this requires an official response outside of us evaluating whether we should create an anti scraping policy to address this in the future.

jonnyWed 23 Nov 2022 11:18PM

I think we should consider an opt-in scraping policy, here are some subtleties to open discussion

what would count as an opt-in? one pole would be explicit affirmative consent, but is there an argument to be made that allowing another account to follow you is an implicit opt-in, maybe just for posts in the recent last, since that's basically what the home feed is?
bots (continuous scraping) vs. single-purpose scraping: is there a distinction to be made between the nobot tag for opting out of continuous scraping vs. a single-time scrape like this?
stored vs. ephemeral: arguably every app/client does some level of scraping if it caches posts. it seems like caching posts temporarily vs. storing permanently should matter
personal use vs. redistribution: I think personal use is really different than making an archive of posts available publicly - like if OP in this case released the training data that would be messed up, but if they just keep it for themselves and destroy it after the project is done is seems more fine to me.

I think there is some value in being able to do some limited scraping-like things, for example designing new interfaces to the fediverse, which would require downloading and at least temporarily storing some posts, potentially en masse (and hopefully only from ppl you follow), and that also makes me think a blanket ban wouldn't really work because we'd all technically be in violation of it all the time. But I do think we should have some language prohibiting the "worst" kind of scraping that indiscriminately scrapes, stores, distributed, etc. from a large number of people, wherever that delineation lies across a few axes

Mica FisherThu 1 Dec 2022 5:57AM

Hi @Ana Ulin, I'm the mod on duty, thanks for flagging this on Mastodon! I don't think any action needs to be taken right away, but I do think this is worth addressing in our CoC. @jonny would you be willing to write a proposal? It sounds like you're most of the way there!

Ana UlinThu 1 Dec 2022 6:56PM

Thank you for the swift response to my report, @Mica Fisher. 💗

Billy SmithThu 1 Dec 2022 10:44AM

From a technical standpoint, it's an interesting project, but from a community/social standpoint, it's a fail.

Scraping is a tool used by spammers.

Ban it in the T&C's of the instance.

And @jsit needs a refresh about Informed Consent in Research Ethics.

J. Nathan MatiasThu 1 Dec 2022 1:17PM

Hello! This is an important issue, and I'm glad to see it discussed here. As a co-founder of the Coalition for Independent Tech Research, I work on the right to ethically study the role of technology in society, including research on issues of consent and refusal of data collection. The challenge is to prevent efforts that are careless, risky, and exploitative, while also enabling data access that improves the health of the fediverse and this instance. Setting aside the pros and cons of this specific case, a policy on scraping/consent would need to work for:

situations where online harassers refuse consent to data analysis that helps with moderation (in the UK, harassers have used GDPR and related UK regulations to try to evade moderation and restraining orders)
situations where asking for individual consent would be burdensome to people by creating too much spam
people who aren't academics being able to contribute to the fediverse through data analysis

I can imagine a number of options working for the fediverse, including community consent (which my lab dies with reddit and Wikipedia communities), or a network of instances that review and approve research studies (the Wikimedia Foundation has discussed in the past).

Nathan TeBlunthuisMon 19 Dec 2022 5:33AM

Maybe one way to handle the policy considerations @J. Nathan Matias raises would be to prohibit scraping using #nobots or some other mechanism and to provide period data dumps instead. Such dumps could provide data of higher quality or ease of use compared to scraping and we could control the dumps to ensure regulatory compliance and that deleted toots. We could possibly exclude users who wish to opt out from public datasets but not from those released to trusted researchers with a research interest that could help us. If build a Mastodon extension for this purpose, that could help our carefully considered practices become broadly adopted.