Bug 392936 - Rare mixup of request content
Summary: Rare mixup of request content
Status: RESOLVED FIXED
Alias: None
Product: Jetty
Classification: RT
Component: server (show other bugs)
Version: 8.1.2   Edit
Hardware: Other Linux
: P3 normal (vote)
Target Milestone: 7.6.x   Edit
Assignee: Project Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords: consistency, core, investigate
Depends on:
Blocks:
 
Reported: 2012-10-26 13:38 EDT by Per Steffensen CLA
Modified: 2012-11-05 03:16 EST (History)
2 users (show)

See Also:


Attachments
Selected logging for test run reproducing the bug. Plus conclusions/thoughts (127.63 KB, application/rtf)
2012-10-26 14:03 EDT, Per Steffensen CLA
no flags Details
Chart helping to understand the request flow and components involved. Plus places where request info has been sampled (90.39 KB, image/png)
2012-10-26 14:05 EDT, Per Steffensen CLA
no flags Details
Solr main webapp filter - java (20.47 KB, text/plain)
2012-10-26 14:07 EDT, Per Steffensen CLA
no flags Details
Modified (for logging) Jetty StreamEndPoint - java (6.48 KB, text/plain)
2012-10-26 14:08 EDT, Per Steffensen CLA
no flags Details
Solr webapp config - web.xml (6.03 KB, text/plain)
2012-10-26 14:08 EDT, Per Steffensen CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Per Steffensen CLA 2012-10-26 13:38:04 EDT
Build Identifier: jetty-8.1.2.v20120308

Hi

I have found a bug in Jetty 8.1.2.v20120308.

Running with many concurrent clients sending requests to Solr (web-app running on top of Jetty) sometimes (very rarely) the content of the HTTP requests are mixed up when handed over by Jetty as HttpServletRequest-objects to the doFilter method of a javax.servlet.Filter configured in web.xml. When "hacking" org.eclipse.jetty.io.bio.StreamEndPoint to log something about what is actually received over the wire, requests does not seem to be mixed up "yet". So it is something that goes wrong in Jetty somewhere between receiving the data over sockets and handing the (parsed) request over to a filter (or servlet, probably) as a HttpServletRequest.

A little about what I mean with "mixed up":

* Data in request A: abcdefghijklmnopqrstuvwxyz
* Data in request B: 1234567890987654321
Well the requests are longer in my concrete cases, and thats probably important, but not when just trying to illustrate

Data read by consecutive "fill"-calls in org.eclipse.jetty.io.bio.StreamEndPoint - e.g:
Regarding request A:
* abcdefghij
* klmnopqrstu
* vwxyz
Regarding request B:
* 12345678909876
* 54321
Seems ok

Data "behind" HttpServletRequest-object handed over to filter/servlet:
Request A: abcdefghijklmnopqrstuvw345678909876

WUT!!!!

Closer investigations show that it is "always" (well I do not have much statistical material) the last part of the data read in the last "fill"-call by the "first request" (request A in this example) that is "magically" replaced by something from a "later request" (request B in this example) when the data reaches the filter/servlet. And that the size of the correct part of the data read in the last "fill"-call by request A is equal to the size of data read in the "violating" "fill"-call (request B) which is before the data the gets mixed from request B into request A. Never mind if you didnt understand (I almost did not :-) ) - let me illustrate by example. E.g. lets say (as the example above also suggests) that it is the last "fill"-call regarding request A that gets mixed up with data from the first "fill"-call regarding request B. The "end result" seen in request A can be:
* abcdefghijklmnopqrstuv2345678909876
* abcdefghijklmnopqrstuvw345678909876
* abcdefghijklmnopqrstuvwx45678909876
* abcdefghijklmnopqrstuvwxy5678909876
But not e.g.
* abcdefghijklmnopqrstuvw5678909876

To me it really seems like org.eclipse.jetty.io.bio.StreamEndPoint reuse (/share) byte-arrays among requests, and that a byte-array used by request A will be "released" for usage by request B before request A has finished reading from it - maybe org.eclipse.jetty.io.bio.StreamEndPoint is finished using the byte-array for request A stuff, but layers higher in Jetty might not have ready from the byte-array yet to get the data for the request-A-HttpServletRequest-object. But what do I know?

I can deepen the explanation and I can provide more "evidence" or details about exactly how I revealed this bug. But first of all I just want to know:
* Is this a known bug?
* Has it been fixed?

Regards, Per Steffensen

Reproducible: Always

Steps to Reproduce:
Too complex to explain
Actual Results:  
Two requests A and B sent to the same jetty at about the same time can in rare cases be mixed up, so that the servlet/filter in the receiving webapp get HttpServletRequest that appear to be constructed partly from the content of request A and partly from the content of request B.

Expected Results:  
Two requests A and B sent to the same jetty at about the same time should ALWAYS be handed over with correct separated content to the servlet/filter in the receiving webapp.

Started discussing the bug on jetty-dev mailing list - see http://dev.eclipse.org/mhonarc/lists/jetty-dev/threads.html#01530
Comment 1 Per Steffensen CLA 2012-10-26 13:46:09 EDT
"Reproducible: Always" is wrong. The problem does not occur for all requests, acutally it occurs very rarely - once among thousands for requests. But I have a very complex test setup where I can reproduce it by trying a few times.
Comment 2 Per Steffensen CLA 2012-10-26 13:46:55 EDT
Well the problem seems to be in Jetty code but very close to the "border" between Jetty and JVM/OS, so I just want to mention that we run on Java HotSpot 1.6.0_26 on Ubuntu 10.04.2 LTS.

Reporting from Linux:

Linux xxxxxxx 2.6.32-29-server #58-Ubuntu SMP Fri Feb 11 21:06:51 UTC 2011 x86_64 GNU/Linux

Ubuntu 10.04.2 LTS

Version reporting from Java:

java version "1.6.0_26"

Java(TM) SE Runtime Environment (build 1.6.0_26-b03)

Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)

Regards, Per Steffensen
Comment 3 Per Steffensen CLA 2012-10-26 13:48:23 EDT
Comment by Thomas Becker:

Hi Per,

this is not a known bug that has been fixed, but looks like a severe issue to investigate. It doesn't look like a jvm issue as according to your description stuff doesn't seem to get mixed up on the wire.

Is there a specific reason that you use a blocking connector? Could you please try a SelectChannelConnector and see if the problem still occurs? That information would give us some hints about where to look first.

Can you try to create a small webapp (maybe + simple loadtest) that reproduces the issue?

Cheers,
Thomas
Comment 4 Per Steffensen CLA 2012-10-26 13:49:47 EDT
> Hi Per,

Hi

First of all, thanks for your quick response! Nice that someone really seem to care and want to actively participate in doing something about it. Have to say that this is not the case on all OSS projects.

>
> this is not a known bug that has been fixed, but looks like a severe issue to investigate.

Agree!

> It doesn't look like a jvm issue as according to your description stuff doesn't seem to get mixed up on the wire.

That was also my conclusion, but Im not a Jetty expert, and havnt really dived very deep into it yet, so I wasnt sure. Just quickly wanted to mention my observations and interpretation in its current state, so that I could get reactions from Jetty community. Didnt want do go very deep into it, if you already knew all about the problem and had a fix.

>
> Is there a specific reason that you use a blocking connector?

Wasnt really aware that I had taken such a decision :-) Im just using Solr "out of the box". It ships with Jetty including a default/easy way of starting Solr running on Jetty. I basically just use that, so havnt done anything to configure Jetty og make any other specific decissions on how it should run.

> Could you please try a SelectChannelConnector and see if the problem still occurs?

Yes certainly! But, in order to do it quicly, I probably would like you to tell me what I should do in order to acheive this. As I mentioned, I havnet really done any configuring or anything to Jetty before. I basically just looked a little at the code and decided to modify StreamEndPoint to do a little logging for me in order to see what it receives from the wire. This way I saw that "received from wire" seems to be fine, but "handed over to Filter/Servlet" seems to be mixed up with content from other requests.

> That information would give us some hints about where to look first.
>
> Can you try to create a small webapp (maybe + simple loadtest) that reproduces the issue?

I can say that I definitely want to participate actively in narrowing in on the "bug" to have it fixed. Creating a small webapp that reproduces the problem "on your side" will certainly be a nice way forward, but if it is the best Im a little bit in doubt. We have several "unit-test"-sized tests that do a fair amount on concurrent requesting and we never came across the problem it such setups. So maybe this problem requires "bigger" setups or something - Im not really sure about the "properties" that need to hold in order for this to occur. Maybe another way forward is that you provide we with additional logging (or something) that I could insert into the code, run the "performance/endurance" test that (sometimes) produces this problem again, and provide the logging to you to evaluate in corrporation with me. Is that a path that could fit you? If we get closer to the cause of the problem following this path, we might be able to reproduce in a smaller ad-hoc test for quicker "turnaround" on potential fixes. Right now I just very much in doubt if I will be able to reproduce this problem in a fairly small "test" to hand over to you.

>
> Cheers,
> Thomas

Regards, Per Steffensen
Comment 5 Per Steffensen CLA 2012-10-26 13:50:36 EDT
Comment by Thomas Becker:

Hi Per,

surely we do care about bugs in jetty. :) My schedule today is pretty tight, I will send you a longer answer. But if you could try switching the connector to SelectChannelConnector in the meanwhile would be good. I wonder why solr is using the blocking one by default.

There should be a config file for jetty where the connectors are being setup (usually jetty.xml in our standard distribution).

In there there's a section setting up the connectors, simply switch the class in yours to: "org.eclipse.jetty.server.nio.SelectChannelConnector". Then it should look similar to this one.

 36     <Call name="addConnector">
 37       <Arg>
 38           <New class="org.eclipse.jetty.server.nio.SelectChannelConnector">
 39             <Set name="host"><Property name="jetty.host" /></Set>
 40             <Set name="port"><Property name="jetty.port" default="8080"/></Set>
 41             <Set name="maxIdleTime">300000</Set>
 42             <Set name="Acceptors">2</Set>
 43             <Set name="statsOn">false</Set>
 44             <Set name="confidentialPort">8443</Set>
 45       <Set name="lowResourcesConnections">20000</Set>
 46       <Set name="lowResourcesMaxIdleTime">5000</Set>
 47           </New>
 48       </Arg>
 49     </Call>

Cheers,
Thomas
Comment 6 Per Steffensen CLA 2012-10-26 13:51:15 EDT
> Hi Per,
>
> surely we do care about bugs in jetty. :) My schedule today is pretty tight, I will send you a longer answer.

I will be (busy-)waiting

> But if you could try switching the connector to SelectChannelConnector in the meanwhile would be good.

Will try to do that - as soon as I get a slot in my tight schedule - probably today or early tomorrow.

Regards, Per Steffensen
Comment 7 Per Steffensen CLA 2012-10-26 13:51:43 EDT
> I wonder why solr is using the blocking one by default.

Well I found a comment in the jetty.xml that ships with Solr:
   <!-- This connector is currently being used for Solr because it
         showed better performance than nio.SelectChannelConnector
         for typical Solr requests.  -->
   <Call name="addConnector">
     <Arg>
         <New class="org.eclipse.jetty.server.bio.SocketConnector">
         ... snip ...

I wouldnt trust the claim in there too much though. But I guess it is not important for us right now, as we are just trying with nio.SelectChannelConnector in order to see if the problem still occurs. Right?

Regards, Per Steffensen
Comment 8 Per Steffensen CLA 2012-10-26 13:52:35 EDT
Comment by Thomas Becker:

Yes, please go ahead and try it for now.
Comment 9 Per Steffensen CLA 2012-10-26 13:53:17 EDT
I have made a new run of my performance/endurance test with org.eclipse.jetty.server.nio.SelectChannelConnector instead of the SocketConnector and the problem still occures, so I guess it is not due to the SocketConnector. Just a question, how did you know that I was running with SocketConnector instead of SelectChannelConnector?

Regards, Per Steffensen
Comment 10 Per Steffensen CLA 2012-10-26 13:53:46 EDT
Comment from Thomas Becker:

Hi Per,

that's as expected and good. So we know that it's not the layer communicating to the network causing the issue. I knew that you use a blocking connector as you told me in the first mail when you said, that you added debug information to: org.eclipse.jetty.io.bio.StreamEndPoint which is only used by the blocking connector.

Later today I'll have a closer look at this and maybe try to write up some test cases that reproduce this.

Cheers,
Thomas
Comment 11 Per Steffensen CLA 2012-10-26 13:54:24 EDT
> Hi Per,
>
> that's as expected and good. So we know that it's not the layer communicating to the network causing the issue. I knew that you use a blocking connector as you told me in the first mail when you said, that you added debug information to: org.eclipse.jetty.io.bio.StreamEndPoint which is only used by the blocking connector.
>
> Later today I'll have a closer look at this and maybe try to write up some test cases that reproduce this.

Please also consider comming up with some logging I can add before reproducing the problem. Then I can send you this logging for further analysis.

>
> Cheers,
> Thomas
Comment 12 Per Steffensen CLA 2012-10-26 13:54:51 EDT
Comment by Thomas Becker:

Hi Per,

I'm still surprised to see such a severe issue in jetty and that we've not heard from that before. Could you please describe in detail how your requests look like including all HTTP Headers and response body?
Could you please also described where exactly you see the mixed up content of the request object, how you read that and where? Does any of your servlets/filters manipulate the request objects? Do the mixed up requests all belong to the same session?

Any way you can zip your application + jetty setup and attach it to a bugzilla? Then describe what kind of requests we should generate to reproduce the issue. If you don't want to share it to the public, please create the bugzilla anyhow and send me the zip personally. We really need to have a test case to investigate if this a jetty issue or not.

Cheers,
Thomas
Comment 13 Per Steffensen CLA 2012-10-26 13:59:03 EDT
> Hi Per,

Hi

>
> I'm still surprised to see such a severe issue in jetty and that we've not heard from that before.

Yes me too!

> Does any of your servlets/filters manipulate the request objects?

Well its 100% Solr web-app running on top of Jetty, so its is not "my" servlets/filters, its Solrs. I am using a "somewhere before 4.0" (4.0 was just released a few days ago, and I am in the process of merging in Apache Solr changes to my version of Solr, so that I will soon be working on a pure Apache Solr 4.0 plus my changes) version of Solr plus some changes made by myself that hasnt been committed to Apache Solr (yet). But to answer your question: No, not as far as I can see.

My Solr web-app includes the attached web.xml. The URLs in question does not match any of the special-servlet-patterns in the web.xml så they go to the filter named SolrRequestFilter which is implemented in class org.apache.solr.servlet.SolrDispatchFilter - find attached the version of SolrDispatchFilter I am using. The interresting part of it is the doFilter method. It is as provided by Apache except that I have added some "debugging"-System.out lines.

> Could you please describe in detail how your requests look like including all HTTP Headers and response body?
> Could you please also described where exactly you see the mixed up content of the request object, how you read that and where?

First of all, I dont beleve "response body" is important, so I will not mention anything about that. Please let me know if you really disagree to that?!?

My modified SolrDispatchFilter collects some information in a StringBuilder "niceInfo" and eventually writes it to System.out. The important stuff in this StringBuilder is added only a few lines into the doFilter method, where I write out all parameters of the HttpServletRequest received from Jetty - besides the parameters there is no body in the request. This addition of all params to "niceInfo" is only done if param "SubRequestGUID" exists (which it does for the requests Im interrested in) and if  param "wt" does not exist (which is only the case for request where I experience that problem in question (long story but that is the way it is)). This logging is used to show that the request handed over from Jetty to the filter is "bad" in the rare cases where I experience the problem. BTW, "SubRequestGUID" is not a request-parameter usually sent by Solr - its a parameter I have added in Solr-code to the request in order to be able to more easily recognize "the same request" on the multiple places of collecting request-data.

Im also added some "debugging"-System.out lines to StreamEndPoint - find attached the version of StreamEndPoint.java I have used. You will recognize it as a copy of 8.1.2 StreamEndPoint plus som logging added in fill-method. This StreamEndPoint.java is compiled add then added to jetty-io-8.1.2.v20120308.jar (replacing the StreamEndPoint.class it is born with).

I have added a lot more logging other places to be pretty sure that the mixup of data is going on in Jetty. Find attached RuinedRequestFlow.png, which I used to explain the problem to my POs and the rest of my team. All the purple dots is where request data has been collected in order to see where it goes wrong, and it seems to be between dot 4 and 1 on the right side of the chart - that is in the Jetty layer.

I have compiled and attached a file DataAndAnalysisRun30_anonymized.rtf with selected loggings. The first logging is just my confirmation that the request sent from the client-side is correct - this logging is catched at the point of the right dot 2 on RuinedRequestFlow.png. The rest is loggings from StreamEndPoint and SolrDispatchFilter on the server-side of the request. In the bottom of DataAndAnalysisRun30_anonymized.rtf you find a few conclusions/thoughts.

I believe this is pretty heavy evidence, but please tell me if you do not agree, and that there is a possible explanation that this is not at bug in Jetty. Please note that all machines in my multi-machine setup are running as Xen VMs on top of XenServer on top of a single physical machine, a 40GB RAM and 8 CPU-core machine, but it has also been seen on other hardware.

> Do the mixed up requests all belong to the same session?
Hmmm, Im not sure what you mean by session. HTTP is state- and session-less. Which kind of "sessions" do you have in mind?
>
> Any way you can zip your application + jetty setup and attach it to a bugzilla? Then describe what kind of requests we should generate to reproduce the issue. If you don't want to share it to the public, please create the bugzilla anyhow and send me the zip personally. We really need to have a test case to investigate if this a jetty issue or not.

You should know that I work as a consultant for a company that do very very very secret stuff. I am not allowed to reveal anything about what we do. I cannot send you any of our code without having to anonymize it completely - and that is practically impossible. I cant share more with you, than I can share with the public - and that is not a lot. But as you will also see from RuinedRequestFlow.png the request that goes wrong is actually sent between two instances of Solr running on top of Jetty, and you can download Solr source code if you want. But it is "our application" that feeds data into the Solr-cluster and it is "our application" that does the search-requests into the Solr-cluster (our application is so to say the "laptop" on RuinedRequestFlow.png). Due the the high confidentially-level in the company I had to modify the loggings in DataAndAnalysisRun30_anonymized.rtf a little bit. I had to replace all occurences of <3-char-string>_<another-3-char-string> with xxx_yyy, because the <3-char-string>_<another-3-char-string> might reveal a little fraction about what my company is doing. Besides that the loggings are completely untouched.

The "test" which reveals this problem is a big-distributed-system-high-load-test with (a.o.) focus on data integrety. Basically it loads 2 mio documents (that we know the exact content of) into the Solr cluster and afterwards it makes queries into Solr for those documents - traversing all documents and making sure that what is in and can be fetched from Solr is the exact 2 mio documents feed into it. After solving a few bugs in Solr this is actually now the case - this (part of the) test is now green. EXCEPT for once in a while where the test seems not to be green - and that is because Solr is not responding correctly to one (or a few) of the many request made during this traversing - and that is because of the bug in question.

It is a good as impossible for me to ever give you "this test" and all the code, automated distributed setting up and configuring of involved software (Solr, ZooKeeper, "our application" etc.). If the two of us do not get anywhere with other attempts, I will spend time (if my boss allows) trying to create a smaller "application/setup" which recreates the problem and which I can send to you. But it seems like a very big task to me, because I really do not know what I can cut and what is important for the bug to occur. Therefore I, again, encourage you to consider my suggested approach, where you/we come up with some logging (or other stuff like thread sync'ing etc.) to add to the code (Jetty - or Solr or anything else for that matter), I run another round of my test and share logs and observations with you. This way we hopefully can get closer to what happens, and maybe later create a smaller application/setup where the problem can be reproduced for shorter turnaround on suggested solutions.

Well thats it for now. A large amount of information to look into, but I have tried to cut it down to the important stuff - you dont want the GB's of logging etc. I have been traversing :-)

>
> Cheers,
> Thomas

Cheers, Per Steffensen
Comment 14 Per Steffensen CLA 2012-10-26 14:03:15 EDT
Created attachment 222886 [details]
Selected logging for test run reproducing the bug. Plus conclusions/thoughts
Comment 15 Per Steffensen CLA 2012-10-26 14:05:34 EDT
Created attachment 222887 [details]
Chart helping to understand the request flow and components involved. Plus places where request info has been sampled
Comment 16 Per Steffensen CLA 2012-10-26 14:07:00 EDT
Created attachment 222888 [details]
Solr main webapp filter - java
Comment 17 Per Steffensen CLA 2012-10-26 14:08:01 EDT
Created attachment 222889 [details]
Modified (for logging) Jetty StreamEndPoint - java
Comment 18 Per Steffensen CLA 2012-10-26 14:08:29 EDT
Created attachment 222890 [details]
Solr webapp config - web.xml
Comment 19 Per Steffensen CLA 2012-10-26 14:11:01 EDT
Comment from Thomas Becker:

Per, I've had a look at all the information you've send me. I've been interested in the request body and not the response for sure, but you've sent that information anyhow.

Your request body is pretty big and you might hit some limit here. I've had a rough look at our code and I think if that would be really the case you should see some exception. On monday I will write a small test case trying to send and parse your request bodies and see what happens on jetty. Maybe I'll also do some concurrent tests with big load sending large request bodies.
Comment 20 Per Steffensen CLA 2012-10-26 14:21:38 EDT
> Per, I've had a look at all the information you've send me. I've been interested in the request body and not the response for sure, but you've sent that information anyhow.
>
> Your request body is pretty big and you might hit some limit here. I've had a rough look at our code and I think if that would be really the case you should see some exception.

I see no exception - at least not around this particular request when the bug occurs.

> On monday I will write a small test case trying to send and parse your request bodies and see what happens on jetty.

Hmmm, good luck, but I dont think you will have any luck reproducing the bug. I have to mention that we have tried to modify our big data-integrity test so that it just resend the exact same request again if (it seems like) the request failed due to this bug. And we are successful during this second attempt at the exact same request (unless we hit the same bug again, but that is very very very rare). So I belive I can say that it is not the content of the request or anything. I must be something that happens under very unfortunate circumstances - I expect some kind situation where (jetty) threads/processes overlap in very unfortunate ways.

> Maybe I'll also do some concurrent tests with big load sending large request bodies.

You might be more lucky there, but dont get your hopes up. It is a very very rare bug, I believe. We have several concurrent unit-test-sized tests at our side that never revealed this problem.
Comment 21 Per Steffensen CLA 2012-11-05 03:15:09 EST
We have now run about 20 runs on Jetty 8.1.7 of the big performance/data-integrity/endurance tests that used to reproduce the problem on 8.1.2. We havnt seen the problem occur yet on 8.1.7 and will therefore (for now) conclude that the bug is solved in 8.1.7 (even though the probability of it occuring might just have been lowered significantly).

This also means that my team and me are not given additional time to participate in the work around this bug - reproducing it in a small unittest-like situation, making sure that it is really solved, etc.

I dont know if the Jetty community will try to create a unittest that reproduces the problem in 8.1.2, to understand the problem, to make sure that it is really solved in 8.1.7 and to have a way of ensuring that it does not show up accidently again later in Jetty. But I kinda hope so, even though it might be hard.

At least now you know what to tell people that might come around and complaint about requests getting mixed up in 8.1.2 - upgrade to 8.1.7.

Thanks a lot for your collaboration!

Regards, Per Steffensen
Comment 22 Per Steffensen CLA 2012-11-05 03:16:25 EST
Solved in 8.1.7!?!