Hi,
well … for one we did not do any performance tests or even optimizations yet with pure SMILA setups.
In our own applications we have a different implementation especially of the ObjectStore service, and the
Implementation in SMILA is currently quite simple, “to make it work”. I suppose that could be improved.
Then, using attachments with the BinaryStorage quite probably slows everything down. We are planning
to change this so that attachments are included in the record bulks. Don’t know when we get to implement
this, however. If you don’t have binary documents to process, you should try to put the crawled pages to
attributes instead of attachments.
On configuration: You can have a look at http://<smila-host>:8080/smila/debug which should show the scale-up
limits for each single worker. The relevant part looks like this on my machine:
"workerManager" : {
"workers" : [
…, {
"name" : "pipeletProcessor",
"runAlways" : false,
"scaleUpLimit" : 4,
"scaleUpCurrent" : 0
}, {
"name" : "pipelineProcessor",
"runAlways" : false,
"scaleUpLimit" : 4,
"scaleUpCurrent" : 0
} ],
If the scaleUpLimits are 1, you have to change something in configuration/org.eclipse.smila.clusterconfig.simple/clusterconfig.json,
or the workers will not process multiple tasks in parallel.
Finally, if the scaleUpCurrent values do not get bigger than 1 (or is even 0 for longer periods) despite the limits allows it, it is possible
that the crawling is too slow and bulk processing is faster than bulk creation (which of course may also be caused by the
objectstore.filesystem implementation, of course). You can also check this on http://<smila-host>:8080/smila/tasks: If there are no
tasks “todo” while crawling, the workers will just sit waiting and scaling will not help.
Hope this helps a bit.
Note that I will not be much in the office for the next 2 days, so I will probably not be able to help more before Friday.
Cheers,
Juergen
From: smila-user-bounces@xxxxxxxxxxx [mailto:smila-user-bounces@xxxxxxxxxxx] On Behalf Of Thomas Menzel
Sent: Wednesday, September 28, 2011 11:02 AM
To: SMILA USERS
Subject: [smila-user] performance degredation with the new processing
Hi folks,
I have done a little perf. test with the new processing and solr and it seems to be slower than before (from 31 mins to 50 min for a subset of the german Wikipedia) .
It seems that scale ain’t doing the trick as it is supposed to (on one machine only) – or, very likely, I don’t know how to config it.
The setup is as follows:
Just one box with quad core and 4GB ram.
I used our (brox) standard AddPipeline that does some aperture like conversion of docs (not really needed in this case but always good to run test including this) and then puts it into the solr index, which is configured the same as in smila (except that I had to switch the date field to be string instead of date due to our open bug). So I pretty much used smila’s default setup as described in 5min to success.
The rest of the config is the same as it was before the processing change except with regard to the Q worker etc not present anymore and the associated mandatory changes.
Now, maxScaleUp was for the 1st run 4 and the 2nd 6. The 2nd run was even 2 mins slower, although that can be neglected and could be due to having started it with crawlW!?
At the same time, CPU utilization was rather low, e.g. only 30-40%.
Any hints? Or does the new processing incure just more overhead but pays off when u also scale wide?
Thomas Menzel @ brox IT-Solutions GmbH