catalog-sig@python.org
[Top] [All Lists]

Re: [Catalog-sig] Rewrite PyPI for App Engine?

Subject: Re: [Catalog-sig] Rewrite PyPI for App Engine?
From: Ian Bicking
Date: Fri, 25 Jun 2010 11:49:16 -0500
On Fri, Jun 25, 2010 at 3:39 AM, M.-A. Lemburg <mal@xxxxxxxxxx> wrote:
Ian Bicking wrote:
> On Thu, Jun 24, 2010 at 5:16 PM, M.-A. Lemburg <mal@xxxxxxxxxx> wrote:
>
>> Almir Karic wrote:
>>> i would like to help out with the move.
>>>
>>> is anyone actually opposed to moving to GAE (either moving the current
>>> code base or re-write, whichever seems more appropriate)?
>>
>> I don't think people are opposed to having a PyPI clone on GAE,
>> but moving the existing installation to GAE is something we would
>> have to discuss separately.
>>
>> I for one would not welcome such a change, since we then completely
>> lose control over service availability.
>>
>
> I don't really understand what this means. ÂServices become unavailable
> sometimes. ÂA computer breaks, a company shuts down, an agreement ends. ÂWe
> don't necessarily have "control" over these situations, but we can respond
> to them. ÂIf App Engine goes down and the App Engine team is all like
> "whatever, we'll get around to fixing stuff sometime" then sure it's a
> problem. ÂBut it's not a plausible problem. ÂThe plausible problem is that
> App Engine goes down, as it has from time to time, and we have to wait for
> them to figure out what's wrong and fix it. Â*We* don't have to fix it, we
> only have to *wait for someone else to do it*. ÂI don't see any reason why
> *we* are any better at fixing issues than the App Engine team would be.
> Also presumably when there is a failure we want for the failure to be
> understood and avoided in the future. ÂThe App Engine team does that. ÂAnd
> they do that *for us*.

I hear you, but don't agree that putting the runtime into the
hands of the GAE would get us an overall better service :-)

The point is that with GAE you only have control over the code
that you post there. Everything else is under control of the GAE
team (and their automatic administration systems), i.e. whether
your data is available and whether there are
proper backups, whether the site is reachable or not, whether
the performance is available and meets your requirements, whether
the service is accessible, fast enough and has low latency, etc.

So if something breaks, you can only fix it, if the problem
is caused by a bug in the code. For all other situations, you
have to wait for the GAE team to go in and do whatever is needed.

I'm not saying that the GAE team would be doing a poor job,
but just sitting there waiting for them to fix it in any
of the typical problem situations (apart from a bug in the
code), is asking a bit much, IMHO.

If GAE was just another hosting system, then sure -- but it's not. For instance, Noah mentioned if Apache went down (or the equivalent) there's someone with a pager who will respond to it. Except GAE isn't actually like that; application instances are can be automatically killed, machines are monitored automatically and brought out of the pool as necessary. We're not replacing our diligence with Google employees, it would be replaced with machines.

Of course there might be network problems or Google's own problems growing the service. But a substantial class of problems (problems that I believe have actually caused downtime) are simply eliminated from the system. GAE has less serviceable parts; that appears like losing control but it's really the normal progression away from manual interactions. I would really like if there was an open source alternative that provided that kind of infrastructure, but there isn't.

Another advantage to GAE is that if there are application errors, it would be much easier for anyone to work on them -- anyone can sign up and receive a free GAE account and deploy the code with almost no effort, and they will be hosting that is completely equivalent to anyone else's hosting. The only difference would be the data set, and it is possible (maybe even likely) that some class of problems will only be noticeable with a full dataset. That's true now as well, like for some UI problems where pages have become unwieldy, and I think it would be really helpful (regardless of GAE) if PyPI had a cleaned-up-export built into it.

Other cloud service providers provide something very different from GAE, and I don't think they would give a lot of benefit. The one advantage I see is that we (well, anyone) could spin up a new instance in a consistent state. Everything else is basically the same, including all the same management issues -- there's no one to kick Apache except us, for instance. Honestly if I have any skin in the game it's actually for a system like this, as I've been working on this sort of infrastructure (http://cloudsilverlining.org) -- I only propose GAE because I genuinely think it will work best for a volunteer-run piece of infrastructure like PyPI.

We have to find a middle ground, where we can still apply the
necessary hand holding ourselves, if we like to, while leaving
most of the day-to-day tasks to automatic tools or other service
providers to deal with.

Since PyPI is becoming a central piece of Python community
infrastructure, we need to make sure that we can provide a very
good uptime of the service and fast access to the data,
esp. for the automatic download tools.

Fortunately, those tools only use static data, so focusing on
making that highly available will get us a much better service
uptime with little extra effort.

> In some catastrophic case we could move the site to another server, use
> TyphoonAE to move the code over (or simply require that there is a
> sufficient abstraction layer to allow for a more normal environment) and
> bring the site up. ÂWe control the domain, we can ultimately control where
> it is hosted. ÂThis kind of failure seems like it would be far more likely
> given our current situation than on App Engine, but moving to App Engine
> would not somehow make this kind of move impossible.

True, but do you really want to go through all that trouble
just because GAE is down or too slow to be usable again ?

That's the catastrophic case, where Google decides they don't care about App Engine or something like that. Right now we'd have to do the same thing if the server's hard disk dies, which is obviously far more likely.

If we were to go for a cloud service to deploy the PyPI runtime, I'd
much rather like to see a standard virtualized server approach
being used.

With that approach, moving (virtual) servers would take
at most 5 minutes, if needed at all - you can rather easily setup
virtual servers as high availability cluster and then have
them manage the failover all by themselves.

Setting up infrastructure for fail-overs is hard, and it would be easy for us to set it up for the wrong pieces (the ones that aren't breaking). In some sense this is why I'm not excited about mirroring, because I don't think it's fail-over for the pieces likely to break.

I do like the static file proposal, also. I think just putting more content into static files could potentially fix most of our problems, along with maybe a bit of server tweaking (to make sure even if PyPI goes down, it doesn't take Apache and the static files with it). I think using a CDN would be a nice step for speed, but is less important for reliability; I think generating things with a cron job will reduce reliability because it's exactly the kind of behind-the-scenes machinery that could break without someone noticing, and we don't have a dedicated staff paying attention to things like that. If a new package registration breaks, I'd far rather it be rejected immediately (e.g., from setup.py register) than for a broken cron job to keep it from getting in the simple index.

--
Ian Bicking Â| Âhttp://blog.ianbicking.org
_______________________________________________
Catalog-SIG mailing list
Catalog-SIG@xxxxxxxxxx
http://mail.python.org/mailman/listinfo/catalog-sig
<Prev in Thread] Current Thread [Next in Thread>