hi,
thx for ur comments
and i will try to answer your concerns. I also will update the wiki
according to this discussion to reflect the status quo. b/c I learned quite a
bit after I posted the initial draft and this hasn’t been worked in yet.
> use case
the problem with the
current design is a follows:
a) we put all messages from an agent into the
same Q, if they be ADDs or DELETEs no matter what. now, a queue as the name
suggests, will deliver the messages as they are in sequence to a consumer.
however, since we have >1 consumers -- our listeners -- there is no way we
can tell in what order they are consumed, especially since we use selectors as
well. (the JMS API spec makes no stipulation in that regard, because normally a
Q is used in p2p connections and not 1:N. cant find the link for it ATM,
but if I do, I will post it gain)
b) the listeners will feed the messages into
their respective pipelines. these pipelines have different processing times,
e.g. processing an ADD record usually takes longer than for a DEL.
c) imagine we have an agent that produces
1.
ADD
2.
DELETE
for the same resource R
the result will
be likely like so: since the ADD executes longer than the DELETE the delete is likely
to be executed before and the ADD adds the deleted resources to the index when
in fact it is deleted in the source.
> problem with doing that in the buffer:
I really cannot see
how we can solve it there in some non-messy way. maybe I'm just too thick and u
have to enlighten me on the subject or u just have a really cool idea. then
please post and I stop what I'm doing
the only way I can
see this working is, if we were to remember all the IDs/hashes for all the messages
we sent out and then intercept processing of those messages that get updated
before they are committed to the index. such a scheme implies the following mechanics:
a) connectivity will need to be informed of the committal
of R to the index, to know that processing of that message has completed.
b) connectivity needs to know where and how to
intercept processing of the message.
BTW: in a clustered scenario this will be quite the challenge, as u will have to
duplicate some of the processing flow in the processing logic as well. timeliness
of intercepting and the added overhead will be quite something, to sort out!
I just thought of
this additional scenario:
what if we have >1
processing targets? e.g. different indexes?
·
where we
have one index to hold the result of a certain processing while having another
to hold a different processing result, e.g. more advanced processing
·
or we just
use 2 different index technologies, wanting to compare them or to have a
fallback strategy
in these cases we
would need to track several processing branches and states in connectivity.
> non cluster due
to memory map
yes, that is right
but I never said anything about this map being solely in RAM. I just used the word
Map w/o tying this to the impl. kind. as we have it in DI: we can use
persistent or just RAM whatever our setup warrants.
but then, I don’t
think this will be much of a problem anyhow in a cluster scenario:
the core idea behind my suggestion is that the resequencer (rseq)
will reorder the messages just before they are given to the indexing service (or
any other processing target) which can then process the messages in proper order.
in most cases there
will be just one node hosting the index service because it seldom makes sense
to have different nodes feeding the same index:
·
often
this is not supported by the technology
·
it incurs
overhead as the requests need to by synchronized
but even if it were
like so: with the current scheme we could still put the resequencer in front of
the services putting the records into a Q in proper order, from which the
service then take in random order.
to elaborate more on
the mechanics:
a) agent adds a seq# to all records that defines
the total order of the records produced by this agent for a given data source.
(this is the change I was referring to)
b) the router sends all messages to
a.
Queue Q0
and
b.
with a
2nd send task to another Qrseq_notify
c) the processing pipelines will listen on Q0 normally
a.
do their
work but w/o adding the records to the index
b.
send the result
to Qrseq_in
d) the rseq will be the sole listener on
a.
Qrseq_notify.
all record's arriving here, will be put into the map (actually only the hashes)
to know what was sent out by the router and is in processing. the message record
are discarded.
b.
Qrseq_in.
all messages in this Q have completed processing and can be put into the index,
independently if this is a DEL or ADD operation. (note: DELs can usually be
sent here directly)
i.
on arrival
of a record the rseq will check in his map if that hash is present and has a higher
sequence # than the current record.
1.
yes: discard
the current operation on the record to be performed
(additionally we could also implement here some logic that actively removes
entries from message queues that have that very hash. however, that is just a
means to save processing time and is not a must)
2.
no: execute
operation
some more notes:
·
when the
agent is modified to produce the seq# we have no problem and can always reproduce
the order in which the records where sent. the change should be very simple
to be done.
if we don’t want to change the agent interface/implementation then we
could also accomplish the same thing by putting a sequencer into the Q0 that
adds that seq# but then we cannot use buffering anymore.
·
the map
of the rseq needs to be cleared of stale hashes by configuration a timeout that
is not exceeded by a processing time, e.g. for those messages that end up in
the deadletter Q
·
the rseq
is impl'ed as a service and is called with 2 diff. invocations, one with the "notify"
and the other with "process" hence it is called from pipelines to fit
in the SMIAL landscape of things.
·
the
approach is cluster safe and independent from the processing target
> not so easy to
config
yes. but id dont have
a better solution for this yet.
we just need to
educate the people on this or make more specific solutions that are built into
the current components, which is not so good IMO.
may I quote:
There is always
an easy solution to every human problem—neat, plausible, and wrong.
http://www.bartleby.com/73/1736.html
Kind regards
Thomas Menzel @ brox IT-Solutions GmbH
From: smila-dev-bounces@xxxxxxxxxxx
[mailto:smila-dev-bounces@xxxxxxxxxxx] On Behalf Of Igor.Novakovic@xxxxxxxxxxx
Sent: Donnerstag, 24. September 2009 19:14
To: smila-dev@xxxxxxxxxxx
Subject: AW: [smila-dev] Message Resequencer :: change to Agent
Interface
Hi Tom,
I share
Daniel’s opinion on both issues.
Before you start
programming (I see that you’ve already opened a dedicated branch in
repository for the resequencer), please let’s discuss the problem and do
some conceptual work.
BTW: What is the use case
that you’re trying to cover with your resequencer?
Regards
Igor
Von:
smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-bounces@xxxxxxxxxxx] Im
Auftrag von Daniel.Stucky@xxxxxxxxxxx
Gesendet: Donnerstag, 24. September 2009 17:31
An: smila-dev@xxxxxxxxxxx
Betreff: AW: [smila-dev] Message Resequencer :: change to Agent
Interface
Hi Tom,
I see two drawbacks
of your proposed solution(s):
1) it will only work on one machine. In a
distributed environment it is not guaranteed, that a Resequencer will get all
the relevant messages concerning one Record. And as each Resequencer has
it’s own map of Ids and sequence numbers two competing operations would
not be recognized as such. The map has to be shared across all Resequencer
instances (e.g. by using another Queue, or a database).
The initial idea of the Buffer component in Connectivity
was to filter out and resolve competing operations before they enter the
“system”, that is before they are processed. Of course this Buffer
would also have to share its internal state across all instances (At the moment
a Agent/Crawler is bound to one instance of Connectivity, so this distribution
is not relevant, yet). In either case, the processing is left totally untouched
by introducing the Buffer component., which leads me to my second issue:
2) The workflow has to be adapted to architecture
changes. So in order to benefit from the Resequencer business logic, workflows
have to be designed in special ways (first do some processing, second store
thee processed data). I think this is hard to grasp by users. BTW: would the
actual storing be configurable, I mean will the Resequencer execute a BPEL
pipeline or is the LuceneIndexing hardcoded ? The latter is of course no valid
scenario, we have to be flexible in this regard, as users may want to store
their data in arbitrary stores/indexes/whatsoever
Perhaps you could
elaborate about your concerns with our initial Buffer idea ?
Bye,
Daniel
Von:
smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-bounces@xxxxxxxxxxx] Im
Auftrag von Thomas Menzel
Gesendet: Dienstag, 22. September 2009 07:28
An: Smila project developer mailing list
Betreff: RE: [smila-dev] Message Resequencer :: change to Agent
Interface
oops,
http://wiki.eclipse.org/SMILA/Specifications/ProcessingMessageResequencer
PS: all along writing
this draft I had this mail open but still managed to forget to add link….
Kind regards
Thomas Menzel @ brox IT-Solutions GmbH
From: smila-dev-bounces@xxxxxxxxxxx
[mailto:smila-dev-bounces@xxxxxxxxxxx] On Behalf Of Igor.Novakovic@xxxxxxxxxxx
Sent: Montag, 21. September 2009 19:04
To: smila-dev@xxxxxxxxxxx
Subject: AW: [smila-dev] Message Resequencer :: change to Agent
Interface
Hi Thomas,
Could you please
provide us the link to your specification draft?
Cheers
Igor
Von:
smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-bounces@xxxxxxxxxxx] Im
Auftrag von Thomas Menzel
Gesendet: Montag, 21. September 2009 18:28
An: Smila project developer mailing list
Betreff: [smila-dev] Message Resequencer :: change to Agent Interface
Hi,
I wrote a
specification draft for this change. plz feel free to comment.
in order for this to implement
I will need to change the interface of the agent.
Kind regards
Thomas Menzel @ brox IT-Solutions GmbH
From: smila-dev-bounces@xxxxxxxxxxx
[mailto:smila-dev-bounces@xxxxxxxxxxx] On Behalf Of Thomas Menzel
Sent: Montag, 21. September 2009 14:00
To: Smila project developer mailing list
Subject: [smila-dev] FYI :: new feature :: Message Resequencer
hi folks,
just wanted to announce and inform you that
I will be working on the problem that messages don’t get out of sync when
there are changes in close succession
this change will be tracked thru the bug https://bugs.eclipse.org/bugs/show_bug.cgi?id=289995
Kind regards
Thomas Menzel @ brox IT-Solutions GmbH