Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[jetty-dev] New Scheduling idea for 9.3


We have been working on http/2 for 9.3 and there is a scheduling itch that I've wanted to scratch since we introduced SPDY.

The issue is that a pure threadpool model is not mechanically sympathetic with multiplexed connections.      It is going to be a bit of work to fix this, so I thought I would outline my ideas here first before committing too much effort.

The issue is that multiplexed connections as used by SPDY and HTTP/2 push us towards a scheduling model where the thread that does the handling/core is not the thread/core that did the parsing.  This has very bad cache behaviour and we have previously documented the issue in our blog: https://webtide.com/avoiding-parallel-slowdown-in-jetty-9/

The way that we currently handle multiplexed connections is as follows:
1. The selector detects IO activity on a connection and dispatches a task call onfillable, which:
1.1. Reads the data on the connection
1.2. Parse all the data received as frames, which if they are requests also involves decompressing them and inflating them into name:value fields, plus the method and URI.  
1.3. For each request, dispatch a task to:
1.3.1 Handle the request
1.3.2 Generate the response

So the problem is that at 1.3. the CPU core has a hot cache full of all the request details, but the chances of the same core executing 1.3.1 is < 50% on a 2 core machine and < 12% on an 8 core machine.  So 1.3.1 will have a cold cache and have to load all the request data. Even if it is the same core, because of the dispatch delay the cache may get flushed by other tasks by the time 1.3.1 is executed.  We also have to incur N+1 dispatches for each batch of N requests.

Worse yet, requests tend to come in batches, often many 10s of requests.  So 1.3. will often dispatch 10 or more requests for handling, which will be sequential in the thread pool queue and thus almost always be taken by different cores - so each core on the machine will end up executing requests from every connection.   As requests from the same connection use common data (eg session), this means that all the cores will be contending for the same data and flushing each others caches.

The pathological example here is if we have 8 cores and 80 connections that each send a batch of 8 requests,  we get a situation where each core will handle 1 request from each of the 80 connections and it will never have a hot cache and will contend with all 7 other cores for the session data.


What we want to achieve is for the one CPU core to execute all the requests from the same connection, preferably immediately after parsing the request.  This will give us the best cache performance and minimal contention.       But it is not as simple as removing the dispatch at 1.3 and just iterating over the request handling, because request handling can block and we have to be able to produce responses out of order!    If we have 10 requests and the first blocks on the DB/input/output, we have to give the next 9 requests some CPU to see if they can run to completion (otherwise we are just repeating all the mistakes of HTTP/1.1 pipelines).

So here is the pattern of how I think we can achieve this (with suitable atomic state machine impl):

1. The selector detects IO activity on a connection and dispatches a task call onfillable, which:
1.1. If another thread is currently between 1.1 and 1.4, Then return.
1.2. Reads any data on the connection
1.3. Parse until just the next request. 
1.4. If there is more data to parse then dispatch a task to 1.1
1.5. Handle the request
1.6. Generate the response
1.7. Loop back to 1.1

So if a single request arrives, a single thread is dispatched and it runs 1.1 to 1.6 and handles it without any additional dispatches. It handles the request with a hot cache and there is no contention by definition (1 request).

If multiple requests arrive then we set up a deliberate race which can happen a number of ways.   For fast request handling the loop will look like:
  • Request N is parsed from the data by thread X.
  • There is more data to be parses so task is dispatched (thread Y)
  • Thread X handles request N and generates a response.
  • Thread X gets back to 1.1 before Y
  • Thread Y arrives at 1.1 and sees X at 1.1-1.4, so Y just returns

If that pattern repeats, we still get N+1 dispatches for a batch of N requests, but N of them are for thread Y which does a noop and goes back to the thread pool. Thread X does all the parsing, handling and generating.   Thus we have a good cache behaviour, low contention and just N thread wakeups that don't flush the caches of other cores that can do other tasks.

Now if one of the requests is slow or blocks, we get a different pattern:

  • Request N is parsed from the data by thread X.
  • There is more data to be parsed so task is dispatched to 1.1 (thread Y)
  • Thread X starts handling request N 
  • Thread Y gets to 1.1 before X, so it tops up the data and parses the request N+1.
  • There is more data to be parsed so task is dispatched to 1.1 (thread Z)
  • Thread Y starts handling request N+1
  • Thread X, Y & Z are all in a race to get to 1.1   and all outcomes are possible ranging from:
    • one of the threads gets to 1.1 the other 2 threads see it there and return
    • one of the threads gets to 1.1 but leaves before the next thread arrives, so all 3 threads go on to handle new requests.

So in the worst case, all N requests will be handled slowly so a new task will be dispatched for each request.  We are still at N+1 dispatches, but at least each core will parse and handle the same request, so we have good cache behaviour.    We still might get multiple cores processing at once, so contention on sessions could still be an issue.  We could probably reduce contention a little bit by being reluctant to dispatch new tasks at 1.4.  ie if a previous iteration has already dispatched a task that has not yet arrived at 1.1, don't dispatch another.  This is also self correcting as on a busy server the thread pool queue gets longer, so thread Y and Z will take longer to arrive, giving more time for X and maybe Y to do all the handling.

Best case scenarios is that thread X manages to parse and handle all the requests in the batch before thread Y reaches 1.1.  In this case we will have only 2 dispatches, good caching and no contention.

So the pathological example now becomes if we have 8 cores and 80 connections that each send a batch of 8 requests, then we get a situation where each core will handle 8 request from just 10 connections and it will always have a hot cache and will never contend with the other cores for the session data.    I doubt we will ever achieve this perfect result, but even in we partially achieve it we will get better performance.

Currently I can't see any significant down side to trying this.   Worse case is not more dispatches than we have now.

There is a possibility that parsing might be a little slower as each request will be parsed by a different core, but it is unlikely that too much of the next request will be loaded into the previous cores cache... and you do have to live with some cache loads sometimes.

If we are to implement this, I think I can wrap up the pattern as an Abstract task producing / handling class (producing==parsing handling==handling), so the scheduling model is separate from the connection handling.  We should then be able to plug in different models to see if we can improve on this or revert to the previous model.

Thoughts?  Does this look worthwhile? am I missing anything?


cheers














--
Greg Wilkins <gregw@xxxxxxxxxxx>
http://eclipse.org/jetty HTTP, SPDY, Websocket server and client that scales
http://www.webtide.com  advice and support for jetty and cometd.

Back to the top