[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
RE: [smila-dev] RE: FYI :: new feature :: Message Resequencer

> BTW: I would be very happy if other team members would join this IMO
> important discussion. Guys, please participate!

Sorry, this discussion started in my vacation, and I'm buried in other work currently,
so I had some problems catching up. And still, I think that I do not completely understand 
the solution proposed in the wiki page. But when reading it a different (but problaby similar) 
solution came to my mind, that could probably work without the need for extending APIs, 
setting up additional queues:

- An agent/crawler could set two attributes (or annotations?) with the same value that somehow 
identifies the event, e.g. the last-modified-timestamp for documents from a file system,
or the document version for documents coming from some real CMS. Or even just a string composed
from an agent/crawler-UUID plus some simple counter value. If the data source delivers document
metadata that can be used for this, it's just configuration. For other data sources, an agent/crawler
would have to generate something.
- One of this attributes is written by the router to the record in the queue, the other one
must only to be stored in record storage. It's just configuration.
- Then a simple pipelet at the start of a pipeline can filter out those records for which these
attribute values are not equal (invalidate record on blackboard and do not return its ID in the
pipelet result): If the values are not equal, it must be because another event has been generated 
for this document which has changed the "version attribute" in the record storage, but not in the 
currently processed event. So the current event is obsolete and can be discarded.

Yes, this solution only works when a record storage is active, but all other solutions also
need some additional storage like additional queues, too. It would even be possible to create
a simple record storage implementation that only stores document ID and "version attribute" in
a small database table, and send all other document metadata in the queue message, if one is 
concerned about the resource requirements.

What do you think about this?
I may not be able to always answer immediately in this discussion, but eventually, I will (-;