Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: AW: [smila-dev] Controlling Tasks Order Concept

Hi guys,

Just had a long discussion with Marius by Skype and want to summarize.

It may be two types of solutions based on one key statement.
This statement may be is shortly described by one question.

When Record object passed into "Processor", is it contains complete Record data or it may be partial?

Sample of partial data may be explained on the next sample.

Two agents collects data from database tables for one Record

table [person] (id, name)  - trigger on update linked with Agent A
table [person_address] (id, person_id, address) - - trigger on update linked with Agent B

Agents A and B collects tables changes and send it to processing, both of them collects data for one object "Person".
when Record contains partial data for Person.

I'm not sure that partial records supporting required.

If its not required, and Record contains complete data, then it possible to use timestamp for rejecting old records.

Otherwise records for one ID should be processed synchronously one-by-one. Organizing of locks for synchronous one-by-one processing will be performance blocker and its may cause some dead-locks on Records. And, imho, almost all MQ asynchronous processing benefits will be lost.

Any ideas, opinions?

--
Regards, Ivan

Ivan Churkin wrote:
Hi Folks,

Many thanks Daniel, Allan and Marius for feedbacks.

Will try to explain problem in detail.

DeltaIndexingManager will blocks concurrent data-source usage but it is not solving problem.

Basically the problem relates to cooperation of two main modules.

First of them is
"Record Producer" = "Crawler" + "Crawler Controller" + "Delta Indexing"

The second is
"Record Processor"  = "Router" + "Listener" + "BPEL engine"

"Producer" blocks concurrent usage of data-source by delta-index, so it is synchronous relating data-source IMO, this blocking works only for Crawlers, but it should be changed when Agents will be added.
A good sample of Agent is database trigger. It's not good to blocks it.

"Processor" is absolutely asynchronous. Basically, it works with some big Record dump. It process records by configured Rules. Processing time may be quite long and it may consist of many steps, when Record put again and again in Queue after each operation.

Even for Crawler mode only, It may be easy occurs situation when
"Producer" twice synchronously crawls data-source but "Processor" still not starts to producing them. After that, it may occurs that different Listener threads catch Records from queue with the same ID (from different crawls).
And they will try asynchronously process it.

BWT: after the second crawl Record will be replaced in Blackboard cache by the last one, but in queue it will be two processes started. And I cannot imagine what may happens finally :(.


As regarding Buffer and adding support and checks of processing-status for each ID, it's a forcing of synchronization by ID
I beware that it may fall down productivity and makes dead-locks problem.


--
Regards, Ivan


Daniel.Stucky@xxxxxxxxxxx wrote:
Hi Ivan,

in the existing concepts the so called Buffer of ConnectivityManager (see http://wiki.eclipse.org/SMILA/Project_Concepts/Connectivity#Buffer_.28P2.29) was meant to deal with these problems.

Some more thoughts:
- do we really want to allow concurrent usage of agents and crawlers on the same datasource ? If so we also have to adopt the current usage of DeltaIndexinManager, as it blocks concurrent usage.

- I agree that there are scenarios where race conditions occur, but I also claim that these are special cases that do not happen all the time. So in my eyes the standard use case has the be optimized in regards to performance, these special cases have to be optimized in regards to robustness. The handling of these special cases should not have any (or as less as possible) impact on the standard cases.

- asynchronous processing of different records is OK, asynchronous processing of the same record is NOT OK (it may lead to corrupt data)

- this is a highly complex functionality, I think we have to discuss it in greater detail. We should list the uses cases and how we expect SMILA to handle them. Then we can discuss a technical solution.

- I also think that we need some mechanism to identify that the processing of a record has finished, either successfully or not (it then may be moved to a dead-letter-queue). E.g. it may be needed if events should be triggered after processing

Bye,
Daniel


-----Ursprüngliche Nachricht-----
Von: smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-
bounces@xxxxxxxxxxx] Im Auftrag von Ivan Churkin
Gesendet: Donnerstag, 9. Oktober 2008 15:15
An: Smila project developer mailing list
Betreff: Re: [smila-dev] Controlling Tasks Order Concept

Hey guys,

Give me some feedback, please ;)
This is very significant problem of architecture...
Now problem is not visible because we only manually starts one Crawler.
It becomes very actual when Agents will be added.

The page contains my ideas for solution only. Unfortunately
documentation for every case will costs time.
If my explanations was not good and its required to write complete
documentation about also inform me.

------------------------------------------------------------------------

_______________________________________________
smila-dev mailing list
smila-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/smila-dev

_______________________________________________
smila-dev mailing list
smila-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/smila-dev



Back to the top