Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
RE: [smila-dev] RE: FYI :: new feature :: Message Resequencer

Hi juergen,

as you stated, this is pretty much the smart resequencer solution as far as I can see.

As you propose the agent/crawler needs to add an attribute to the record which can be used to determine the sequence. I currently propose a simple counter (sequence number == SN ) that is added as an annotation.

What I didn't quite get was:
> - One of this attributes is written by the router to the record in the
> queue, the other one
> must only to be stored in record storage. It's just configuration.

And

> - Then a simple pipelet at the start of a pipeline can filter out those
> records for which these
> attribute values are not equal (invalidate record on blackboard and do
> not return its ID in the
> pipelet result):

Plz explain this further and how you mean it and how that is going to work.

Thx for your inout.

Kind regards
Thomas Menzel @ brox IT-Solutions GmbH


> -----Original Message-----
> From: smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-
> bounces@xxxxxxxxxxx] On Behalf Of Juergen.Schumacher@xxxxxxxxxxx
> Sent: Mittwoch, 7. Oktober 2009 12:24
> To: smila-dev@xxxxxxxxxxx
> Subject: RE: [smila-dev] RE: FYI :: new feature :: Message Resequencer
> 
> Hi,
> 
> > BTW: I would be very happy if other team members would join this IMO
> > important discussion. Guys, please participate!
> 
> Sorry, this discussion started in my vacation, and I'm buried in other
> work currently,
> so I had some problems catching up. And still, I think that I do not
> completely understand
> the solution proposed in the wiki page. But when reading it a different
> (but problaby similar)
> solution came to my mind, that could probably work without the need for
> extending APIs,
> setting up additional queues:
> 
> - An agent/crawler could set two attributes (or annotations?) with the
> same value that somehow
> identifies the event, e.g. the last-modified-timestamp for documents
> from a file system,
> or the document version for documents coming from some real CMS. Or
> even just a string composed
> from an agent/crawler-UUID plus some simple counter value. If the data
> source delivers document
> metadata that can be used for this, it's just configuration. For other
> data sources, an agent/crawler
> would have to generate something.
> - One of this attributes is written by the router to the record in the
> queue, the other one
> must only to be stored in record storage. It's just configuration.
> - Then a simple pipelet at the start of a pipeline can filter out those
> records for which these
> attribute values are not equal (invalidate record on blackboard and do
> not return its ID in the
> pipelet result): If the values are not equal, it must be because
> another event has been generated
> for this document which has changed the "version attribute" in the
> record storage, but not in the
> currently processed event. So the current event is obsolete and can be
> discarded.
> 
> Yes, this solution only works when a record storage is active, but all
> other solutions also
> need some additional storage like additional queues, too. It would even
> be possible to create
> a simple record storage implementation that only stores document ID and
> "version attribute" in
> a small database table, and send all other document metadata in the
> queue message, if one is
> concerned about the resource requirements.
> 
> What do you think about this?
> I may not be able to always answer immediately in this discussion, but
> eventually, I will (-;
> 
> Cheers,
> Juergen.
> _______________________________________________
> smila-dev mailing list
> smila-dev@xxxxxxxxxxx
> https://dev.eclipse.org/mailman/listinfo/smila-dev
> 
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 8.5.420 / Virus Database: 270.14.4/2417 - Release Date:
> 10/07/09 05:18:00


Back to the top