Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [smila-user] performance degredation with the new processing

As I said, just disable the BinStorage and you should be fine. Then attachments stay in Mem on the blackboard
and will be discarded after the task is finished.
Yes, the current plan is that attachments are just stored within the record bulks in the ObjectStore and so they
will be removed together with the bulk when it is processed. How they are handled by the blackboard then is 
not fully decided: Very large attachments will probably have to be written to a local temp directory during 
processing. Or something like that.

Cheers,
Juergen.

-----Original Message-----
From: smila-user-bounces@xxxxxxxxxxx [mailto:smila-user-bounces@xxxxxxxxxxx] On Behalf Of Thomas Menzel
Sent: Wednesday, September 28, 2011 3:35 PM
To: Smila project user mailing list
Subject: Re: [smila-user] performance degredation with the new processing

Hi,

Scratch that though about an InMem bin. store. I just noticed that this is a service and hence will retain then everything -> OOM death certain.
This could only work when having a pipelet that erases everything from the BB at the end of the pipeline.

Wondering if it's worth pursuing this solution at all. if I remember the idea was to replace the bin store with the obj store. Is that still correct?

Thomas Menzel @ brox IT-Solutions GmbH


-----Original Message-----
From: smila-user-bounces@xxxxxxxxxxx [mailto:smila-user-bounces@xxxxxxxxxxx] On Behalf Of Thomas Menzel
Sent: Mittwoch, 28. September 2011 13:21
To: Smila project user mailing list
Subject: Re: [smila-user] performance degredation with the new processing

Hi,

we dont use attachments in our pipeline, i.e. we don’t read the bin. content with the FS crawler, it is read directly from FS as needed ( known thru the file’s path).

But following ur hint with the ObjStore, whichisnt the culprit as it stores nothing, I noticed that the bin store now contains the binary content of the files since we add it as attachments to the record in the pipeline. 
Now, before we had used transient BBs (sync = false) that woudnt write to bin store, now it does. Looking @ the code I see that in org.eclipse.smila.processing.worker.ProcessingWorker.getBlackboard() controls what BB type is used and there is even the member flag _tryToUseBinStorage but it cannot be set by way of config it seems.

I guess for now I could write a simple InMemory BinStore that keeps the stuff for the duration of the pipeline -- and I think that is fairly safe in regard to OOMs. 
But I think we should be able to config that on a pipeline/workflow basis what service instances are actually used -- or can u already somehow?

I guess also I could refactor the pipeline into workers and using buckets/storages but I think others will have this issue too when migrating to 0.9 and if I understood it right, object stores behind the buckets and they are not optimal impl. either, right?

Any another idea on the subject?

Thomas Menzel @ brox IT-Solutions GmbH

From: smila-user-bounces@xxxxxxxxxxx [mailto:smila-user-bounces@xxxxxxxxxxx] On Behalf Of Jürgen Schumacher
Sent: Mittwoch, 28. September 2011 11:46
To: Smila project user mailing list
Subject: Re: [smila-user] performance degredation with the new processing

Hi,

well … for one we did not do any performance tests or even optimizations yet with pure SMILA setups. 
In our own applications we have a different implementation especially of the ObjectStore service, and the Implementation in SMILA is currently quite simple, “to make it work”. I suppose that could be improved.
Then, using attachments with the BinaryStorage quite probably slows everything down. We are planning to change this so that attachments are included in the record bulks. Don’t know when we get to implement this, however. If you don’t have binary documents to process, you should try to put the crawled pages to attributes instead of attachments.

On configuration: You can have a look at http://<smila-host>:8080/smila/debug which should  show the scale-up limits for each single worker. The relevant part looks like this on my machine:

  "workerManager" : {
    "workers" : [
      …, {
      "name" : "pipeletProcessor",
      "runAlways" : false,
      "scaleUpLimit" : 4,
      "scaleUpCurrent" : 0
    }, {
      "name" : "pipelineProcessor",
      "runAlways" : false,
      "scaleUpLimit" : 4,
      "scaleUpCurrent" : 0
    } ],

If the scaleUpLimits are 1, you have to change something in configuration/org.eclipse.smila.clusterconfig.simple/clusterconfig.json,
or the workers will not process multiple tasks in parallel. 
Finally, if the scaleUpCurrent values do not get bigger than 1 (or is even 0 for longer periods) despite the limits  allows it, it is possible that the crawling is too slow and bulk processing is faster than bulk creation (which of course may also be caused by the objectstore.filesystem implementation, of course). You can also check this on http://<smila-host>:8080/smila/tasks:  If there are no tasks “todo” while crawling, the workers will just sit waiting and scaling will not help.

Hope this helps a bit. 
Note that I will not be much in the office for the next 2 days, so I will probably not be able to help more before Friday.

Cheers,
Juergen



From: smila-user-bounces@xxxxxxxxxxx [mailto:smila-user-bounces@xxxxxxxxxxx] On Behalf Of Thomas Menzel
Sent: Wednesday, September 28, 2011 11:02 AM
To: SMILA USERS
Subject: [smila-user] performance degredation with the new processing

Hi folks,

I have done a little perf. test with the new processing and solr and it seems to be slower than before (from 31 mins to 50 min for a subset of the german Wikipedia) .

It seems that scale ain’t doing the trick as it is supposed to (on one machine only) – or, very likely,  I don’t know how to config it.

The setup is as follows:
Just one box with quad core and 4GB ram.
I used our (brox) standard AddPipeline that does some aperture like conversion of docs (not really needed in this case but always good to run test including this) and then puts it into the solr index, which is configured the same as in smila (except that I had to switch the date field to be string instead of date due to our open bug). So I pretty much  used smila’s default setup as described in 5min to success.

The rest of the config is the same as it was before the processing change except with regard to the Q worker etc not present anymore and the associated mandatory changes.

Now, maxScaleUp  was for the 1st run  4 and the 2nd  6. The 2nd run was even 2 mins slower, although that can be neglected and could be due to having started it with crawlW!?

At the same time, CPU utilization was rather low, e.g. only 30-40%. 

Any hints? Or does the new processing incure just more overhead but pays off when u also scale wide?

Thomas Menzel @ brox IT-Solutions GmbH


Taglocity Tags: smila
_______________________________________________
smila-user mailing list
smila-user@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/smila-user
_______________________________________________
smila-user mailing list
smila-user@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/smila-user

Back to the top