Add pull to Diaspora's push model in federation
I had some ideas a while ago about improving communication between pods in instances where it currently falls down, but didn't know enough about how federation works to be able to flesh them out. Now, helped by Fla's blog post about federation to understand more about how it works, I've refined those ideas.
Just for clarity, this is only a speculative concept. I understand the technical issues only poorly, and so my suggestions as I've presented them may not be workable. However, I hope that, even if this proves to be the case, my suggestions will spark ideas in those of you who understand the technical side of Diaspora which might help to improve Diaspora's federation protocols.
At the moment Diaspora relies solely, or almost solely, on pushing data from one pod to another. This means that if a pod does not receive data when it is pushed, there is no way for that pod to retrieve these data at a later time. I suggest that if we're going to keep Diaspora working on a push model, we supplement this by enabling pods to pull data under certain circumstances.
New pods
Pods only receive data from pods with which they have an established connection. Currently, this means users making connections with users on other pods, and this takes time. I suggest putting in place an automatic means of connections with other pods so that this process can be done automatically, immediately the pod goes online, so that when users start using the pod, these connections with other pods are already in place.
I suggest putting in place a sort of 'handshake' system.
The process would work something like this:
- Podmin sets up Pod Z, and puts it online. Pod Z knows about Pod A.
- Pod Z contacts Pod A, and says 'Hi, which pods do you know about?'
- Pod A gives Pod Z a list of pods it knows about.
- Pod Z adds each of these pods to its knowledge base.
- Pod Z contacts each of these pods and asks the same question in step 2.
- This process is repeated until Pod Z is not finding out about any more new pods.
This way the new pod would very quickly build connections with the whole network.
Of course, there needs to be some means of establishing the first pod to contact (Pod A). This could be prompted by going to the pod of whichever account new accounts are set to auto-follow on that pod (currently the Diaspora HQ account, which is located on joindiaspora.com). Alternatively a list of a few key pods could be kept on diasporafoundation.org (not as a web page visible to visitors, but somewhere from which pods can FTP the data), or the pod could get the information from a site such as podupti.me, which is frequently updated.
One possible way of doing this would be to automatically create 'bot' accounts on each pod which communicate with each other via the above protocol. I'm calling them 'pod-spiders'. If Pod Z knows about Pod A, [email protected] adds [email protected] to its aspects in order to contact it, and so on. I'm sure the inter-pod communication could be done without setting up bot accounts, and might be a better way to do it. As much as anything, the 'pod-spider' concept is a visual aid.
Tags
As tags are not federated, you could also have each pod-spider account follow all the tags that users on its pod follow or search for. (This could involve only tags that have been searched more than 5 times or are followed by more than 5 people, to eradicate spelling mistakes.) When Pod Z goes online, [email protected] can also ask each pod it contacts 'which tags do you know about?' and can then follow those tags itself. In this way, it might be possible to populate tag searches from the time the pod goes online.
Alternatively, when a user searches for a tag which is not currently in that pod's database, the pod can pull the data on that tag from all the pods it is connected to. That way, the first time a tag search is done on that pod, it is done by a pull, which would take longer but at least would get the data. After that, data relating to that tag can be pushed to the pod in the usual way.
Non-communication
There are also some circumstances in which an established pod doesn't receive data that are pushed – for example, if a pod goes offline for a while or is temporarily over capacity. In these circumstances, it would be helpful if the pod can pull data when it goes back online.
At the moment, when Pod A can't push data to another Pod B, it puts the data back into its send queue and retries a number of times at intervals. When the last of these retries has taken place, Pod A stops trying, whether or not it has been successful. If not successful by the last of these attempts, there is no possibility of the data getting from Pod A to Pod B.
For my suggestion to work, at the end of this process of retries, if the data still cannot be pushed, Pod A should write all data destined for Pod B to a log rather than placing them back in its queue. Pod B is placed on a list of 'pods incommunicado, do not attempt to communicate', and Pod A stops trying to push new data to Pod B, instead writing it to the log. This would save network resources. Once this has happened, when there are new data destined for Pod B, Pod A should add them to this log instead of attempting to push them to Pod B. (Pod A could perhaps continue to attempt communication with Pod B say once a day, and if successful can then push the logged data.)
When Pod B is back online, it immediately communicates with all pods known to it and says: 'I'm back. What have I missed?' When Pod A receives this communication, it refers to its log for Pod B, retrieves the data and sends them to Pod B, and once it receives confirmation that this transfer has been successful, deletes the log and removes Pod B from the 'do not communicate' list.
This should (a) allow pods to receive data pushed when they were unavailable, and (b) save network resources currently wasted by pods trying to communicate many times with pods which are unavailable.
There may be other circumstances in which it would be good for a pod to be able to do a pull request – perhaps if it hadn't heard from a pod for a set period of time. However, this would involve pods keeping logs of data destined for other pods even when it hasn't detected a communication problem, so may be a waste of resources.
Nick Sat 17 Aug 2013 2:33PM
This suggestion sounds generally good to me as a non techie. I particularly like the tag-searching aspect of it.
Maciek Łoziński Mon 18 Nov 2013 1:51PM
I also like this solution more than some central-hub and tag-aggregator ideas. P2P is the way for us to go, I think. Diaspora has “Decentralization” as one of it’s key philosophies ( https://diasporafoundation.org/ ). I don't want it to loose that.
There are many working examples of decentralized networks (eDonkey, FreeNet), so it’s not impossible to do.
Jason Robinson Mon 18 Nov 2013 2:38PM
Haven't read this properly when it was posted. While there are definitely some good things here and the whole idea might work, it sounds to me a kind of "every pod knows every pod" thing. So while that would certainly solve some problems, it's not a solution that would scale. It's not realistic to have public posts for example federated in this way, unless we allow diaspora* as a network to stay small. I don't know about eDonkey and FreeNet but afaik P2P is not what "everybody can follow anyone" is about. Diaspora works very well if you know who to follow. But if you just want to follow posts tagged with something - it simply will not scale, the work required to pass those messages around needs to be outsourced from the diaspora server code.
Maciek Łoziński Tue 19 Nov 2013 7:33PM
I think there are solutions to that - it could be enough for a certain pod to know a few other pods, who could just pass its query over. Another possibility - there could be some sort of shared list, saying which pod knows which tags. Some kind of routing algorithm. Another thing that I find bad when we try to centralize something - we have to maintain the code of this central hub, or relays, or tag aggregator.
Maciek Łoziński Tue 19 Nov 2013 7:34PM
oh, and "everybody knows everybody" is certainly a bad idea in my opinion :-)
Jason Robinson Tue 19 Nov 2013 7:50PM
@macieklozinski however something is done, there is always code to maintain. It's usually better to split features into separate components and not build one big product that does everything. The server code is already being cleaned up into separate repositories with the federation code being split out thanks to the amazing work by @florianstaudacher :)
Maciek Łoziński Tue 19 Nov 2013 8:16PM
There are some benefits of this, but I'd rather see a network of similar nodes connected to eachother than a group of different services which need to be installed separately and depend on eachother.
Jason Robinson Tue 19 Nov 2013 8:26PM
Sure, the pod should be something that just works, I agree. You're missing the point that the relay/hub/taggregator ideas are things that podmins don't need to install - the project with community volunteers would maintain those.
Maciek Łoziński Tue 19 Nov 2013 9:25PM
And you probably need time/money/discussion to maintain these volunteers. But, on the other hand, you need these also to maintain extra pod code...
Flaburgan · Thu 8 Aug 2013 10:01AM
Well, we don't need to do that. To save resources (network, CPU, database), we try to talk only to the pods we need to, and the least possible.
If I set up a pod and all my contacts come on my pod, or only on one external pod, why should I know the whole network?
So instead of New pod, I would talk about new user, meaning user not known by my pod (no existing relation). Being able to pull bio, old posts etc the first time a user is reaching by a pod is a good improvement. But that's only for the first relation: if someone else in my pod adding the contact after me, no need to pull, we will received the data (pushed) because of the other sharing relation.
So knowing the whole network is useful only for the very special case of tag searching, and we definitely need to find a more global solution about that.