Social.coop - Tech working group

Fri 17 Aug 2018 10:34AM

Heads up on social.coop server space

Nick S Public Seen by 25

Just to report on the last server outage whilst I'm thinking about it....

This last one was resolved by @fardog :raised_hands: who happened to be awake at the right time. He discovered that it was because it was running low on disk space on the main (root) partition.

He pruned some docker image cruft, but it's still currently at 93% full.

Now I can't explain exactly why it's so full, but obviously it's something we need to do something about or our server will start dying all the time. It's not the Mastodon database, that's on another (also 64% full) disk.

(Perhaps the growth was related to the recent Mastodon influx I've been hearing about, but either way we should expect more users and more tooting...)

Sorting this out may require taking the server down for a while, I suspect.

There's also backups, I notice the database backup file has quadrupled in size since about June (2G -> 8.4G), which probably needs investigation. I say 'backup' because we currently just have a manual backup of the database, and it's only run when someone remembers to. In order to protect ourselves from various trousers-falling-down scenarios we might encounter, we need an automated back-up, ideally generational, which also means more 10s - 100s of gigabytes of (off-server) disk space.

Does anyone here know much about analysing Mastodon instances, or know someone who does?

And this touches on the issue of spending funds, which is a different issue but I'll mention here: perhaps we should allocate a budget to working groups, which they can spend at their discretion without the need to go back to the main / finance group?

For those with git.coop accounts, you can see the tickets I created on the recent outage and the disk space question here. I suggest we keep the technical discussion there as much as possible to spare those here who have been overwhelmed by Loomio chat. :) Anyone who wants an account can sign up following the instructions here https://git.coop/social.coop/

Victor Matekole Sat 18 Aug 2018 9:08PM

Sorry, "root servers" implies dedicated hardware/servers (not virtual), as far as I understood Scaleway is a cloud service? You are correct regarding SATA vs SSD. However, Hetzner will allow you mixed setups, we can have SSD for Postgres and SATA for lesser demanding parts of the stack. Either way, I am sure we'd pay less per GB than on Scaleway. But as I suggested earlier there is always a trade-off — having a dedicated server means we look after the hardware, if a disk breaks we have to call Hetzner to replace, from experience they are reasonably fast, in this case.

Nonetheless, I've always felt a 100GB was never enough for our growth rate and requirements long-term. Hetzner was just an example, as I know them but I have no bias. I just wanted to start a conversation, where growth rate, performance and cost are carefully considered.

E.g. piece of hardware:

Intel Core i7-2600
2x HDD SATA 1,5 TB
HDD1x SSD 240 GB
RAM 32GB DDR3
€45.38 / mth

Gil Scott Fitzgerald Sat 18 Aug 2018 9:11PM

I wonder if we could just throw postgres in RAM?

Mayel de Borniol Sat 18 Aug 2018 9:12PM

As indicated in the docs 'trunk' is a dedicated server, and 'toot' is VPS:
https://git.coop/social.coop/tech/operations/wikis/infrastructure-overview

Victor Matekole Sat 18 Aug 2018 9:15PM

I see ... Do they support upgrades of the disk and perhaps memory?

Victor Matekole Sat 18 Aug 2018 9:19PM

BTW — how do I get an account to git.coop? Just tried to register under my email address but was denied.

Fabián Heredia Montiel Sat 18 Aug 2018 9:33PM

Hi @victormatekole, check out this guide on the steps to get your git.coop account: https://git.coop/social.coop/general/wikis/getting-an-account

Nick S Sat 18 Aug 2018 11:14PM

I think one of our milestones should be the capability (duplicated amongst several people) to rebuild the server in the event it dies or gets hacked.

In order to learn how to do this, we need a server (or servers) to practice on.

I'd call this a "staging server".

Gil Scott Fitzgerald Sat 18 Aug 2018 7:27PM

IMO spend the money for a good experience and fewer headaches later

Victor Matekole Sat 18 Aug 2018 9:10PM

Disk consumption is now 80% by the way but there is more that can be trimmed from the media cache, I think someone restarted the ruby app and thus the job I started got killed.

Nick S Sat 18 Aug 2018 11:21PM

Wasn't me, honest!

In general I aim to go to the riot.im channels to check if anything's going on on our servers, or to announce it on the public channel if I'm there doing something. I suggest this'd be a good policy for everyone to follow, to help avoid tripping each other up by mistake.

open channel: https://riot.im/app/#/room/#SocialCoop:matrix.org
encrypted private channel: https://riot.im/app/#/room/#tech.social.coop:matrix.org

Nick S Sun 19 Aug 2018 9:27AM

Also, I should add, if this was running in a docker container, I have been noticing a lot of 'dying and restarting' events when browsing the datadog account Mayel (I think) set up to monitor our servers. (Maybe it was you originally, however you did say it was unused and should be removed, and it seems to be a new free account).

If you have any experience interpreting these, I'd be interested what you think...

And anyone else on the tech team who's interested, go and have a look, it's quite impressive. I can either paste the credentials into the tech group's private channel, or maybe I'll get time to get keryringer set up.

Victor Matekole Sun 19 Aug 2018 3:23PM

Glad you are finding Datadog useful, it is pretty amazing tool! I thought it should be killed as I understood they were removing their free option or at least limiting it to 30 days... I may have got that wrong, last time I checked I could not gain access with my current credentials for social.coop. If you send me the credentials I'd be happy to give my 2 cents...

Ian Smith Mon 20 Aug 2018 11:49PM

Social.coop returning 502 bad gateway. @victormatekole @wulee

Nick S Tue 21 Aug 2018 9:55AM

Thanks. As I mentioned in the chat channel, it seems to have resolved itself...

There've been a bunch of outages like this, in which there's a 502 or similar, and a pingometer/pingdom notification, which mysteriously resolves itself. I'm a bit of a newbie with docker, but it looks like one of the containers will die and then restart. I'd like to know why this happens, I'm still researching that. Maybe @mayel or @victormatekole or one of the other admins will be able to shed some light on that, but at least it isn't currently a critical problem (and I don't think it's a disk related problem).

Nick S Tue 21 Aug 2018 9:58AM

Timezones: I think we have admins who can fix server issues in the EU and US timezones (assuming they're not indisposed for some reason). Do we have anyone in the Asian timezones in between who could do this?

Victor Matekole Fri 24 Aug 2018 7:10AM

When I have chance to look at Datadog I will check to see what maybe the root cause. When I look at mem. consumption there is only 200mb free, I wonder if we are hitting some memory limits, which is common with Rails apps as they tend to be resource heavy and leak memory especially from poorly written 3rd-party packages.

Nick S Fri 24 Aug 2018 7:42AM

I was trying to get the memory/CPU load overlayed with docker events, to see if they correlate. I think I managed it and concluded that the memory grows and then gets resets when there's an event, but this is across the whole system, and yet doesn't imply that memory causes the events rather than vice versa.

Chris Croome (Webarchitects Co-operative) Tue 21 Aug 2018 9:50AM

Hi, one option you could consider for hosting is buying your own hardware, if you can raise the capital, you could get a 1U server with a lot of RAM and SSDs and HDDs which could run everything (assuming you run a hypervisor on it and multiple virtual servers) and have space for development servers and backups (though you would probably also want backups elsewhere) and then colocate it with a hosting co-operative. Most new servers come with a three warranty — it would make sense to budget for renewing it after 3 years, however at that point the old machine could be used as a backup as, in my experience, servers can generally be run for about ten years.

Victor Matekole Fri 24 Aug 2018 7:02AM

I like the idea of owning bare metal! Some cost-benefit analysis would have to be performed but I suspect it would be cheaper in the long-run as the network grows in numbers.

Heads up on social.coop server space

Victor Matekole · Sat 18 Aug 2018 9:08PM

Gil Scott Fitzgerald · Sat 18 Aug 2018 9:11PM

Mayel de Borniol · Sat 18 Aug 2018 9:12PM

Victor Matekole · Sat 18 Aug 2018 9:15PM

Victor Matekole · Sat 18 Aug 2018 9:19PM

Fabián Heredia Montiel · Sat 18 Aug 2018 9:33PM

Nick S · Sat 18 Aug 2018 11:14PM

Gil Scott Fitzgerald · Sat 18 Aug 2018 7:27PM

Victor Matekole · Sat 18 Aug 2018 9:10PM

Nick S · Sat 18 Aug 2018 11:21PM

Nick S · Sun 19 Aug 2018 9:27AM

Victor Matekole · Sun 19 Aug 2018 3:23PM

Ian Smith · Mon 20 Aug 2018 11:49PM

Nick S · Tue 21 Aug 2018 9:55AM

Nick S · Tue 21 Aug 2018 9:58AM

Victor Matekole · Fri 24 Aug 2018 7:10AM

Nick S · Fri 24 Aug 2018 7:42AM

Chris Croome (Webarchitects Co-operative) · Tue 21 Aug 2018 9:50AM

Victor Matekole · Fri 24 Aug 2018 7:02AM