Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
AW: [smila-dev] SMILA/Specifications/CrawlerAPIDiscussion09

Hi Folks

thank you Daniel and Ivan for an interesting exchange of your experience with the Crawler-API. So we have adept important vertices that we have to follow, while we discuss about this crawler api. I full agree that features like SCA should be compatible - although not every user will need this. Could you imagine to a solution that make the crawler api easier and that allows to use the sca technologie or not if user don´t need this?

At next, in my opinion it´s our common target, that we hold or improve the performance of the crawler api. I think the performance of the current crawler api could be one of the innovations of Smila, so I think nobody want change this in that way. My first mind when I read the previous mail was about  converting a simple crawler structure into a the current structure. I mean: Crawler developer could realize a new crawler in a easy way and the crawler controller will receive the information in a way that supports the features and performance like today. 

I think the suggestion of Ivan is a possible way in this direction. If this is possible, we could have an easily structure for crawler developer and can use the " Daniel_Crawler_ interface" with same or more performance and all features. 

What do you all think about this way? Is this possible or if there are any problems we haven´t have discussed or not focused at this time? Feel free to share your opinion with us.

Allan

-----Ursprüngliche Nachricht-----
Von: smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-bounces@xxxxxxxxxxx] Im Auftrag von Ivan Churkin
Gesendet: Mittwoch, 24. September 2008 11:40
An: Smila project developer mailing list
Betreff: Re: [smila-dev] SMILA/Specifications/CrawlerAPIDiscussion09

Hi folks,

Daniel.Stucky@xxxxxxxxxxx wrote:
> Hi all,
>
> I think the Crawler API cannot be viewed on its own. Dependencies to other components and workflows have to be taken into account, too. So let's take a look at the goals Ivan mentioned, as I think they are valid:
>
> - Simplicity: I agree that the current API cannot be called simple. Any improvements that further this are desirable. But I think we should always sacrifice Simplicity in favor of Effectiveness.
>   
Fortunately, effectiveness will be higher because HASH will be 
calculated automatically on the crawler side in communication RI layer 
and there is no necessity to sent whole data through SCA to controller.
 From the SCA point of view it will be old Crawler API.
> - Independence: The Crawler concept was designed with SCA in mind, because it offers useful functionality. At first SCA offers us the possibilities of wiring CrawlerController and Crawlers using different technologies (e.g. RMI, Corba, etc.) and the potential use of other technologies to implement Crawlers beside Java. In addition there is the concept of "Conversations" (like sessions) that allow us to host multiple conversations from CrawlerController to Crawler (crawling multiple DataSources in parallel) without having to implement it by ourselves or have to worry (too much) about multithreaded access (of course some things need to be taken care about).
> I don't know if it's possible to provide a SCA-Crawler Wrapper for concrete Java Crawler implementations. The AbstractCrawler class was a first step in that direction, but SCA Annotations are still needed in the implementation classes. If time allows I will check out if there are additional possibilities to allow complete disjunction of SCA and Crawler logic.
>   
SCA dependency will be in the communication RI layer, no one feature of 
SCA will be missed.
> - Effectiveness: I agree that the handling and creation of ID and HASH is ineffective. My first proposal was to generate these objects inside each Crawler, but I was outvoted :-)
> But the current API is effective concerning performance, especially in conjunction with remote Crawlers and DeltaIndexing. Remember that the goal of SMILA is to process Millions of documents. DeltaIndexing works best if as least data as possible is transferred between Crawler and CrawlerController. The idea was to send the ID and the HASH of a object (where ID and HASH is created inside the Crawler). The complete object is only transferred if DeltaIndexing allows it (it is a new or a changed object). If a Crawler runs remote then the number of method calls gets important, too. So there was the idea of block operations (getting the ID and HASH for multiple objects with one method calls using an array). I did some performance tests, please see http://wiki.eclipse.org/SMILA/Project_Concepts/IRM#Performance_Evaluation for details.
>
> I don't see the possibility to make efficient use of DeltaIndexing with Ivan's API proposal. Method next() always returns a complete DataSourceReference object, containing all the data. If DataSourceReference is intended to be only a reference to the real data (like a proxy) then the problem of way too many method calls increases even more, as for each Attribute/Attachment a separate method call is required.
>
>   
"DataSourceReference" is only a reference, it contains no data it's 
something like URL.
"DataSourceReference" processed completely on crawler side before 
communicating by SCA.
New Crawler interfaces suggested may be written as one interface

interface Crawler {

 void start(IndexOrderConfiruration config);
 boolean next();
 Object getAttribute(String name);
 byte[] getAttachment(String name);
 void finish();
}

Here next() returns only a boolean, and "DataSourceReference" will be hidden inside Crawler implementation.
Its absolutely the same idea (but I like more interface split to pair).


> As Allan said, "Looking forward to an interesting discussion" :-)
>
>   
Maybe the problem of misunderstanding only in interfaces names?

Let's prefix it with person names here in discussion :)

Daniel_Crawler_ interface - requires complete implementation for each 
crawler

Ivan_Communication_RI_interface  ==  Daniel_Crawler_interface

Ivan_Communication_RI_interface  - requires only reference 
implementation ( one times )
Ivan_Crawler_interface - requires complete implementation for each 
crawler, but it's significantly simpler.

I only suggested to extract and write once class for 
instantiating/working with DIData, Record, Hash, etc and communicating.
This work/coding it absolutely identical for each crawler implementation.

> Bye,
> Daniel
>
>
>   
>> -----Ursprüngliche Nachricht-----
>> Von: smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-
>> bounces@xxxxxxxxxxx] Im Auftrag von Allan Kaufmann
>> Gesendet: Dienstag, 23. September 2008 10:06
>> An: Smila project developer mailing list
>> Betreff: AW: [smila-dev] SMILA/Specifications/CrawlerAPIDiscussion09
>>
>> Hi Ivan,
>>
>> thanks for your contribution. I agree with you completely - it would be
>> a very nice solution if a crawler developer haven´t have knowledge
>> about MObject, Record etc.
>>
>> Your interface-suggestion looks very easy to understand - I think it
>> would be great if this structure would be possible. What´s the opinion
>> of the other smila developers? Are there any disadvantages for this
>> idea?
>>
>> Looking forward to an interesting discussion.
>>
>> Allan
>>
>> -----Ursprüngliche Nachricht-----
>> Von: smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-
>> bounces@xxxxxxxxxxx] Im Auftrag von Ivan Churkin
>> Gesendet: Montag, 22. September 2008 11:09
>> An: Smila project developer mailing list
>> Betreff: Re: [smila-dev] SMILA/Specifications/CrawlerAPIDiscussion09
>>
>> Hi Allan,
>>
>> Thank you for the response on crawler api
>> (http://wiki.eclipse.org/SMILA/Specifications/CrawlerAPIDiscussion09)
>> discussion. This very important question was in frozen state.
>>
>> In my opinion, crawler developer should know nothing about SMILA inner
>> objects and transports (MObject, Record, Deltra Indexing, SCA, etc).
>> He should implement only simple and understandable data-source
>> iterator.
>>
>> Approx. interface:
>>
>> interface Crawler {
>>  void start(IndexOrderConfiruration config);
>>  DataSourceReference next();
>>  void finish();
>> }
>> interface DataSourceReference {
>>  Object getAttribute(String name);
>>  byte[] getAttachment(String name);
>> }
>>
>>
>> I will be glad to hear and to discuss other ideas and opinions.
>>
>> --
>>
>> Ivan
>>
>>
>>
>> Allan Kaufmann wrote:
>>     
>>> Hi peoples
>>>
>>> I have read this interesting discussion about the crawler api
>>>
>>>       
>> (http://wiki.eclipse.org/SMILA/Specifications/CrawlerAPIDiscussion09).
>>     
>>> In my opinion it´s currently not easy to understand the crawler api,
>>> but I believe this should be a target if you want users and
>>>       
>> developers
>>     
>>> for this project who like it. I looked to this filesystem-crawler
>>> sample in your current smila trunk and need much time to understand
>>>       
>> this.
>>     
>>> So what about keeping the crawlerapi simple like discussed on this
>>>       
>> site?
>>     
>>> I think a nice way is to reduce the MObject and record creation to
>>> make it easier, maybe delivering all information together to
>>> crawlercontroller with an ArrayList. OK, probably I know you need to
>>> have a communication between Crawlercontroller and crawler to make
>>> generation indexing possible. So what about the second alternative,
>>> which was that getNextDeltaIndexing returns record. In that case the
>>> crawlercontroller received the information for id and hash. Then, if
>>> information are changed, the getRecord-method delivers the other
>>> attributes also as record and crawlercontroller could merge this. I
>>> think that would be easier to understand, but the other alternatives
>>> discussed on this site are also worth to discuss or decide about.
>>>
>>> Greetings
>>>
>>> Allan
>>>
>>> Allan Kaufmann
>>>
>>> *brox *IT-Solutions GmbH*
>>> *An der Breiten Wiese 9
>>> 30625 HANNOVER (Germany)
>>> Tel: +49 (5 11) 33 65 28 - 67
>>> eFax: +49 (5 11) 33 65 28 - 98 78
>>> Fax: +49 (5 11) 33 65 28 - 29
>>> Mail: akaufmann@xxxxxxx <mailto:tmenzel@xxxxxxx>
>>> Web: www.brox.de <http://www.brox.de/>
>>>
>>> ==================================
>>> According to Section 80 of the German Corporation Act brox
>>> IT-Solutions GmbH must indicate the following information.
>>> Address: An der Breiten Wiese 9, 30625 Hannover Germany
>>> General Manager: Hans-Chr. Brockmann
>>> Registered Office: Hannover, Commercial Register Hannover HRB 59240
>>> ========== Legal Disclaimer ==========
>>>
>>> ---------------------------------------------------------------------
>>>       
>> ---
>>     
>>> _______________________________________________
>>> smila-dev mailing list
>>> smila-dev@xxxxxxxxxxx
>>> https://dev.eclipse.org/mailman/listinfo/smila-dev
>>>
>>>       
>> _______________________________________________
>> smila-dev mailing list
>> smila-dev@xxxxxxxxxxx
>> https://dev.eclipse.org/mailman/listinfo/smila-dev
>> _______________________________________________
>> smila-dev mailing list
>> smila-dev@xxxxxxxxxxx
>> https://dev.eclipse.org/mailman/listinfo/smila-dev
>>     
> _______________________________________________
> smila-dev mailing list
> smila-dev@xxxxxxxxxxx
> https://dev.eclipse.org/mailman/listinfo/smila-dev
>   

--
Regards, Ivan
_______________________________________________
smila-dev mailing list
smila-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/smila-dev


Back to the top