Intermittent media server load problems

Image (1) pokey-file-server.png for post 3748We’ve been seeing some general slowdowns in our image and media file serving recently, including some instances in the last couple days where the sites as a whole have been affected to the point of extreme slowness or temporary inaccessibility.
Domas believes this is related to this reported problem with NFS performance when ZFS snapshots are active. We’ve had some luck so far with it improving after dropping older snapshots (possibly along with restarting NFS and temporarily disabling the image scaler servers to give it a little breathing room to reset).
We’ve been planning for some time to redo the way we access our media files internally which can help reduce the impact on the rest of the site when load problems on the file servers occur, but we might also be able to spread out the load among multiple servers to improve things even more.
Updates will come as we get things back on track…
Update 2009-07-15: We’re temporarily shutting off uploads while we apply the ZFS fix patch and reboot the main file server. You may see some missing images or funky error messages for a little bit, but the sites should otherwise continue working normally until the file server is back up.
Update 2: Server is patched and uploads are back online. This should resolve our performance problems while we continue rearranging the upload servers to be more future-proof.
Brion Vibber, Lead Software Architect

Archive notice: This is an archived post from blog.wikimedia.org, which operated under different editorial and content guidelines than Diff.

13 Comments
Inline Feedbacks
View all comments

Please consider posting a “We are aware of the problem and are working on it” type message as a watchlist message or other prominent place with a pointer to this or other status page. It is very helpful to alleviate user frustration by letting them know that they are not the only one having problems and that the issue is being worked on. Thanks.

Yeah, trying to see what’s the best way we can get a decent status message going that’s easy to find and doesn’t annoy folks. 🙂

and that does not make the server go down even more 😀

Yeah, I was going to suggest CentralNotice… but maybe not? 😀

good post… i can get many information on this site… thank’s….thank’s….

Have been having trouble uploading larger files recently; higher than normal error rate. Could this be related?

Quite likely; the problems we’re seeing cause the file server to ssslllloooowwww down, which can cause delays or failures when you need to touch files — that’ll certainly include uploading them, and the bigger it is the more likely you are to hit it at a bad time or get delayed extra long.
We’re prepping a patch to the Solaris kernel now which should resolve it, as well as setting up to split off the thumbnail load to another machine.

I’m noticing it

Guess I picked a terrible time to start uploading some photos. The first five or six went through (some gave an error message at the end, but the photo was uploaded correct), but I finally hit a wall with one that hasn’t uploaded on a half-dozen (long) attempts in the past 20 minutes. Guess I’ll take a break.

Himmel Arsch und Zwirn!

Ok, we’ve patched the server which should resolve the super-slowness bug for now while we continue rearranging servers to keep more breathing room on the disks.

Out of curiosity, what patch was required to fix the issue?

The patch changes the way the ZFS block allocator searches for a spot to save things in high-fragmentation cases. (It didn’t fix it entirely for us, but it should at least push back the threshold at which we start seeing the pathological behavior.) We tracked down one of the ZFS devs at OSCON, so should be able to get some more detailed poking if we can replicate the conditions. 🙂