Re: [smila-user] performance degredation with the new processing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [smila-user] performance degredation with the new processing

From: Jürgen Schumacher <juergen.schumacher@xxxxxxxxxxxxx>
Date: Wed, 28 Sep 2011 11:46:14 +0200
Accept-language: de-DE
Acceptlanguage: de-DE
Delivered-to: smila-user@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/smila-user>
List-help: <mailto:smila-user-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/smila-user>, <mailto:smila-user-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/smila-user>, <mailto:smila-user-request@eclipse.org?subject=unsubscribe>
Thread-index: Acx9uzpl7T1wRhTkRZijTRQWeSJR5gABJLmg
Thread-topic: performance degredation with the new processing

Hi,

well … for one we did not do any performance tests or even optimizations yet with pure SMILA setups.

In our own applications we have a different implementation especially of the ObjectStore service, and the

Implementation in SMILA is currently quite simple, “to make it work”. I suppose that could be improved.

Then, using attachments with the BinaryStorage quite probably slows everything down. We are planning

to change this so that attachments are included in the record bulks. Don’t know when we get to implement

this, however. If you don’t have binary documents to process, you should try to put the crawled pages to

attributes instead of attachments.

On configuration: You can have a look at http://<smila-host>:8080/smila/debug which should show the scale-up

limits for each single worker. The relevant part looks like this on my machine:

"workerManager" : {

"workers" : [

…, {

"name" : "pipeletProcessor",

"runAlways" : false,

"scaleUpLimit" : 4,

"scaleUpCurrent" : 0

}, {

"name" : "pipelineProcessor",

"runAlways" : false,

"scaleUpLimit" : 4,

"scaleUpCurrent" : 0

} ],

If the scaleUpLimits are 1, you have to change something in configuration/org.eclipse.smila.clusterconfig.simple/clusterconfig.json,

or the workers will not process multiple tasks in parallel.

Finally, if the scaleUpCurrent values do not get bigger than 1 (or is even 0 for longer periods) despite the limits allows it, it is possible

that the crawling is too slow and bulk processing is faster than bulk creation (which of course may also be caused by the

objectstore.filesystem implementation, of course). You can also check this on http://<smila-host>:8080/smila/tasks: If there are no

tasks “todo” while crawling, the workers will just sit waiting and scaling will not help.

Hope this helps a bit.

Note that I will not be much in the office for the next 2 days, so I will probably not be able to help more before Friday.

Cheers,

Juergen

From: smila-user-bounces@xxxxxxxxxxx [mailto:smila-user-bounces@xxxxxxxxxxx] On Behalf Of Thomas Menzel
Sent: Wednesday, September 28, 2011 11:02 AM
To: SMILA USERS
Subject: [smila-user] performance degredation with the new processing

Hi folks,

I have done a little perf. test with the new processing and solr and it seems to be slower than before (from 31 mins to 50 min for a subset of the german Wikipedia) .

It seems that scale ain’t doing the trick as it is supposed to (on one machine only) – or, very likely, I don’t know how to config it.

The setup is as follows:

Just one box with quad core and 4GB ram.

I used our (brox) standard AddPipeline that does some aperture like conversion of docs (not really needed in this case but always good to run test including this) and then puts it into the solr index, which is configured the same as in smila (except that I had to switch the date field to be string instead of date due to our open bug). So I pretty much used smila’s default setup as described in 5min to success.

The rest of the config is the same as it was before the processing change except with regard to the Q worker etc not present anymore and the associated mandatory changes.

Now, maxScaleUp was for the 1^st run 4 and the 2^nd 6. The 2^nd run was even 2 mins slower, although that can be neglected and could be due to having started it with crawlW!?

At the same time, CPU utilization was rather low, e.g. only 30-40%.

Any hints? Or does the new processing incure just more overhead but pays off when u also scale wide?

Thomas Menzel @ brox IT-Solutions GmbH

Taglocity Tags: smila

Follow-Ups:
- Re: [smila-user] performance degredation with the new processing
  - From: Thomas Menzel

References:
- [smila-user] performance degredation with the new processing
  - From: Thomas Menzel

Prev by Date: [smila-user] performance degredation with the new processing
Next by Date: Re: [smila-user] performance degredation with the new processing
Previous by thread: [smila-user] performance degredation with the new processing
Next by thread: Re: [smila-user] performance degredation with the new processing
Index(es):
- Date
- Thread

Breadcrumbs