Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
AW: [smila-dev] SMILA/Specifications/CrawlerAPIDiscussion09

Hi all,

I think the Crawler API cannot be viewed on its own. Dependencies to other components and workflows have to be taken into account, too. So let's take a look at the goals Ivan mentioned, as I think they are valid:

- Simplicity: I agree that the current API cannot be called simple. Any improvements that further this are desirable. But I think we should always sacrifice Simplicity in favor of Effectiveness.

- Independence: The Crawler concept was designed with SCA in mind, because it offers useful functionality. At first SCA offers us the possibilities of wiring CrawlerController and Crawlers using different technologies (e.g. RMI, Corba, etc.) and the potential use of other technologies to implement Crawlers beside Java. In addition there is the concept of "Conversations" (like sessions) that allow us to host multiple conversations from CrawlerController to Crawler (crawling multiple DataSources in parallel) without having to implement it by ourselves or have to worry (too much) about multithreaded access (of course some things need to be taken care about).
I don't know if it's possible to provide a SCA-Crawler Wrapper for concrete Java Crawler implementations. The AbstractCrawler class was a first step in that direction, but SCA Annotations are still needed in the implementation classes. If time allows I will check out if there are additional possibilities to allow complete disjunction of SCA and Crawler logic.

- Effectiveness: I agree that the handling and creation of ID and HASH is ineffective. My first proposal was to generate these objects inside each Crawler, but I was outvoted :-)
But the current API is effective concerning performance, especially in conjunction with remote Crawlers and DeltaIndexing. Remember that the goal of SMILA is to process Millions of documents. DeltaIndexing works best if as least data as possible is transferred between Crawler and CrawlerController. The idea was to send the ID and the HASH of a object (where ID and HASH is created inside the Crawler). The complete object is only transferred if DeltaIndexing allows it (it is a new or a changed object). If a Crawler runs remote then the number of method calls gets important, too. So there was the idea of block operations (getting the ID and HASH for multiple objects with one method calls using an array). I did some performance tests, please see http://wiki.eclipse.org/SMILA/Project_Concepts/IRM#Performance_Evaluation for details.

I don't see the possibility to make efficient use of DeltaIndexing with Ivan's API proposal. Method next() always returns a complete DataSourceReference object, containing all the data. If DataSourceReference is intended to be only a reference to the real data (like a proxy) then the problem of way too many method calls increases even more, as for each Attribute/Attachment a separate method call is required.


As Allan said, "Looking forward to an interesting discussion" :-)

Bye,
Daniel


> -----Ursprüngliche Nachricht-----
> Von: smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-
> bounces@xxxxxxxxxxx] Im Auftrag von Allan Kaufmann
> Gesendet: Dienstag, 23. September 2008 10:06
> An: Smila project developer mailing list
> Betreff: AW: [smila-dev] SMILA/Specifications/CrawlerAPIDiscussion09
> 
> Hi Ivan,
> 
> thanks for your contribution. I agree with you completely - it would be
> a very nice solution if a crawler developer haven´t have knowledge
> about MObject, Record etc.
> 
> Your interface-suggestion looks very easy to understand - I think it
> would be great if this structure would be possible. What´s the opinion
> of the other smila developers? Are there any disadvantages for this
> idea?
> 
> Looking forward to an interesting discussion.
> 
> Allan
> 
> -----Ursprüngliche Nachricht-----
> Von: smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-
> bounces@xxxxxxxxxxx] Im Auftrag von Ivan Churkin
> Gesendet: Montag, 22. September 2008 11:09
> An: Smila project developer mailing list
> Betreff: Re: [smila-dev] SMILA/Specifications/CrawlerAPIDiscussion09
> 
> Hi Allan,
> 
> Thank you for the response on crawler api
> (http://wiki.eclipse.org/SMILA/Specifications/CrawlerAPIDiscussion09)
> discussion. This very important question was in frozen state.
> 
> In my opinion, crawler developer should know nothing about SMILA inner
> objects and transports (MObject, Record, Deltra Indexing, SCA, etc).
> He should implement only simple and understandable data-source
> iterator.
> 
> Approx. interface:
> 
> interface Crawler {
>  void start(IndexOrderConfiruration config);
>  DataSourceReference next();
>  void finish();
> }
> interface DataSourceReference {
>  Object getAttribute(String name);
>  byte[] getAttachment(String name);
> }
> 
> 
> I will be glad to hear and to discuss other ideas and opinions.
> 
> --
> 
> Ivan
> 
> 
> 
> Allan Kaufmann wrote:
> >
> > Hi peoples
> >
> > I have read this interesting discussion about the crawler api
> >
> (http://wiki.eclipse.org/SMILA/Specifications/CrawlerAPIDiscussion09).
> >
> > In my opinion it´s currently not easy to understand the crawler api,
> > but I believe this should be a target if you want users and
> developers
> > for this project who like it. I looked to this filesystem-crawler
> > sample in your current smila trunk and need much time to understand
> this.
> >
> > So what about keeping the crawlerapi simple like discussed on this
> site?
> >
> > I think a nice way is to reduce the MObject and record creation to
> > make it easier, maybe delivering all information together to
> > crawlercontroller with an ArrayList. OK, probably I know you need to
> > have a communication between Crawlercontroller and crawler to make
> > generation indexing possible. So what about the second alternative,
> > which was that getNextDeltaIndexing returns record. In that case the
> > crawlercontroller received the information for id and hash. Then, if
> > information are changed, the getRecord-method delivers the other
> > attributes also as record and crawlercontroller could merge this. I
> > think that would be easier to understand, but the other alternatives
> > discussed on this site are also worth to discuss or decide about.
> >
> > Greetings
> >
> > Allan
> >
> > Allan Kaufmann
> >
> > *brox *IT-Solutions GmbH*
> > *An der Breiten Wiese 9
> > 30625 HANNOVER (Germany)
> > Tel: +49 (5 11) 33 65 28 - 67
> > eFax: +49 (5 11) 33 65 28 - 98 78
> > Fax: +49 (5 11) 33 65 28 - 29
> > Mail: akaufmann@xxxxxxx <mailto:tmenzel@xxxxxxx>
> > Web: www.brox.de <http://www.brox.de/>
> >
> > ==================================
> > According to Section 80 of the German Corporation Act brox
> > IT-Solutions GmbH must indicate the following information.
> > Address: An der Breiten Wiese 9, 30625 Hannover Germany
> > General Manager: Hans-Chr. Brockmann
> > Registered Office: Hannover, Commercial Register Hannover HRB 59240
> > ========== Legal Disclaimer ==========
> >
> > ---------------------------------------------------------------------
> ---
> >
> > _______________________________________________
> > smila-dev mailing list
> > smila-dev@xxxxxxxxxxx
> > https://dev.eclipse.org/mailman/listinfo/smila-dev
> >
> 
> _______________________________________________
> smila-dev mailing list
> smila-dev@xxxxxxxxxxx
> https://dev.eclipse.org/mailman/listinfo/smila-dev
> _______________________________________________
> smila-dev mailing list
> smila-dev@xxxxxxxxxxx
> https://dev.eclipse.org/mailman/listinfo/smila-dev


Back to the top