[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
RE: [smila-dev] RE: FYI :: new feature :: Message Resequencer

See below:

Kind regards
Thomas Menzel @ brox IT-Solutions GmbH

> -----Original Message-----
> From: smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-
> bounces@xxxxxxxxxxx] On Behalf Of Igor.Novakovic@xxxxxxxxxxx
> Sent: Dienstag, 6. Oktober 2009 12:02
> To: smila-dev@xxxxxxxxxxx
> Subject: AW: [smila-dev] RE: FYI :: new feature :: Message Resequencer
>
> Hi,
>
> > > This issue _does not_ occur when crawling some data source.
> > there might be rare cases where it could occur there (links to the
> > same resource, e.g. the same document referenced from 2 websites)
> Please explain this.
> My assumption is, that the referenced resource _does not_ change that
> fast, so in that case (same document referenced from 2 websites) this
> issue does not occur.

it doesn't have to change fast, just at the 'right' time. Example: let there be resources A and B that reference the same resource R. A and B are part of a source that is crawled when changes may occur during the crawling process. the processing is set up such that R is processed as a result of processing A and/or B but added as a distinct item with its own ID that is the same independent of the referrer. If
- A and B are processed by diff. processing branches (also pipeline instances) that have diff. lengths
and
- if R is changed in the period of processing R as a child of A and processing R as a child of B but before either process commits to index
Then that case also occurs for crawlers

But remember: The case is rare. It is a technicality.

>
> > > (Crawling is the most common use case.)
> > not sure if crawling really is the most common case.
> > In the past I usually integrated our former product more in the agent
> > style
> Ok. I did not know that.
>
>
> > > This issue only occurs if the data source has been monitored by an
> > > agent _and_ the user is doing "ADD" (or "UPDATE") and subsequently
> > > almost instantly an "DELETE" operation on the document (data set
> that is represented later as a record in SMILA).
> > a) doesn't only affect ADD/DEL ops but also ADD/ADD as pointed out
> Yes. But the _delta_ is in ADD/ADD case very small. In worst case you
> process a slightly different version of the document. This is something
> that the user may live with.
> By executing ADD/DEL in reverse order the result is totally
> unacceptable: Instead of removing the document from (index, storages
> etc.) the document will be still there.

Sure ADD/DEL is worse than ADD/ADD but, as u say: "... that the user may live with".
What if the customer doesn't want to or even more important if the use case is such that it is unacceptable?

> > > b) > Instantly
> > a. Highly depends on the setup
> Please explain this.
>
>
> > b. generalized: as long as the change on the same resource occurs
> > within the time period a previous change event is being processed.
> Yes. But how long does the processing take?
> For me is anything that lasts more than 0.5s to long.
> Any event processing that takes less than 0.5s is for me "instantly".
> Or do you assume that the user can make _significant_ changes on the
> document in less than a half of a second?

maybe not a user. But what if the event is not triggered by a user but by system? What if the resource is an aggregated result built in real time from diff. input parameters? If any of these change, then the aggregated result changes and that might happen in rapid succession!

> > > The chances that this happens are very low.
> > Highly depends on the setup and where u get the data from and the
> frequency that this data changes.
> As said above: Please give me some realistic example.
>
>
> > > agreed that it should be addressed in the connectivity module by a
> component called "Buffer".
> > As I pointed out, this solution is
> > a) not safe
> > b) might not meet the application needs/requirements
> As I stated in my mail, let's first define the needs/requirements and
> then discuss the technical solutions and their pros and cons.
>
>
> > > Since this issue occurs very rare, it can be generally rated as
> "low".
> > Well, with the assumptions you have made, yes. If those assumptions
> fail: it is not low IMO.
> > As I said: It depends on the use case. Here are some (think agent):
> Please do not oversee the fact that we want to buffer user operations
> in order to:
> a) Do not execute superfluous operations and thereby lower the load on
> our application
> b) Make sure that the _order of consuming messages from queue does not
> matter_ and therefore
>       * make high scalability possible
>       * completely avoid the use case that we are discussing now
>
>
> > Use cases:
> > * a wiki that I used by many users concurrently:
> >   o Here it can happen fairly frequently that the same page is saved
> twice in fast succession.
> >   At least it happens to me that after saving I notice a typo or add
> a quick note, resulting on another "save".
> Sure. That is exactly what I meant with a non significant change!
> This is also one of the reasons why we should not do instant processing
> after _every_ user action.
>
>
> > * web application
> >  o in order to ease the DB load,  the search is the primary means to
> access the data,
> >    especially those where a complex SQL query would be crafted.
> >  o To have always an accurate result, a minimum time diff. between
> >   resource change and index update is required.
> Are you sure?
> AFAIK the DB load (produced by a web app) is being reduced with some
> caching technique - not with the search.

a) off-loading the DB into a search index to reduce DB load is a valid technique. Especially where the systems cross funding responsibilities. E.g. a service provide may allow you to access his system to get information but only to certain extend, such that the provide doesn't have to increase the server capacity to meet your wants.
b) cache and is not always possible to meet the search needs.
For example: when u use a fault tolerant search or pattern matching. Then the DB might not be capable of doing that at all, or only with extremely high processing costs.

Before I answer all your questions (and I feel rather silly about that) I think the main diff. between our two approaches are, that you are looking for a solution that works 90% of the times while I have been thinking about a solution that works 100% of the times (at least I hope so) and hence the elaborate wiki page. I have drafted the idea w/o a concrete particular application and scenario in mind. But i have been thinking about and have already mentioned the use cases that have to be covered if 100% is what you want to achieve. If you don't need it, you can at least take the cases and see where your solution falls short or makes compromises and if that works in your scenario.

IMO it makes little sense to say that the buffer is the one thing that will do in most case and that is the only thing we are going to do, b/c it is my experience that sooner than later a project has the requirement to cover an aspect that a compromise solution like the buffer will not.

Of course, the buffer is simpler and will cover most cases and I think it's worth having (e.g. even in addition to my resequencer to remove PRs in close succession) but, there are also significant draw backs which I have mentioned in the wiki about the buffer but to which you haven't given any answer.

Since the buffer idea is not mine, I also haven't tried to match it against all use case/requirements that I have listed for a search index.  but some of its short comings are:

a) the buffer cannot guarantee correct order b/c
-- you have to guess the execution time and assume that all PRs have the same -- or rather you must take the longest processing time for all unless u provide means to config. diff. ones based on some condition.
-- processing times my change due to server load caused by whatever
b) it at least doubles the processing time for it to work. To increase certainty you will need to multiply this with a safety factor.

Further problems:

# Split records
we have A and B items in an index and they are related N:M.
Application needs require that you fetch the related item during processing, whether u have an  A or B item at hand and process it as well.

Since the split records are created after connectivit, this case is not covered.

# processing done event (new requirement)
currently we have no means to tell if processing has finished or not for a given ID/resource.

Therefore, an application cannot know if all items are processed and committed to the index. The SRS can offer such an event and can be possibly extended be queried to what still being processed.

>
> Regards
> Igor
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 8.5.420 / Virus Database: 270.14.3/2415 - Release Date:
> 10/05/09 18:23:00