Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [recommenders-dev] [Jayes] Jayes and Apache Spark Integration

Hi Ekin,
There is no integration with Spark that I know of. The used algorithm is basically message passing, so could be distributed, and also parallelized to a degree. But you would have to rewrite a lot if not most of the code.

But there are a few caveats:
- nodes have different sizes, central nodes with many parents will take up most of the space, so distributing the nodes may not actually achieve anything depending on your model (code recommender models, at the time I wrote Jayes, had that). Neither for space nor for time. Of course in other cases this may work.
- double precision values are used in the computations, and lots of multiplication happens, meaning they can underflow which results in a NumericalInstabilityException. Larger models are more at risk because more small values get multiplied. I would expect a Spark implementation should somehow deal better with that (use a different number type or something) to support the large models it is meant for. Not all big models will necessarily have that problem, but it's something to be aware of.

The design decisions for Jayes:
Trade space for performance (things are cached etc.)
Exact inference
One inference at a time (JunctionTreeAlgorithm is not thread-safe)

Also regarding memory leak, I would be surprised if there was one, there are not too many places where something could leak,  and memory consumption was tested. On the other hand, feel free to prove me wrong ;-)
The error looks like the model is just too big.

The first place you should look is your model - can it be simplified? Are there indepence relations that you can somehow exploit?

Hope this helps a little. If you decide to try to port Jayes to Spark, I would be interested in hearing about the results and your experience.

Regards, Michael


Am 20.10.2017 13:01 schrieb "Ekincan Ufuktepe" <ekincanufuktepe@xxxxxxxxx>:
Hi everyone,

I have been working with Jayes about 3 years and I love it. I work with thousands of nodes and sometimes I have nodes with lots of parent nodes. 

I have modified my heap memory size and I have reached my limits. I also created a heap dump and I received a "problem suspected" message given below, which might be a memory leak I am not sure:

The thread java.lang.Thread @ 0x6000762b8 main keeps local variables with total size 5,578,598,448 (99.42%) bytes.

The memory is accumulated in one instance of "org.eclipse.recommenders.jayes.factor.AbstractFactor[]" loaded by "sun.misc.Launcher$AppClassLoader @ 0x600013810".

The stacktrace of this Thread is available. See stacktrace.


Keywords
org.eclipse.recommenders.jayes.factor.AbstractFactor[]
sun.misc.Launcher$AppClassLoader @ 0x600013810

From total size I guess you can see how large Bayesian nets that I am dealing with and this is only one of the smallest one :)

As a solution I thought if using Apache Spark would solve this problem. I read that Spark supports many machine learning tools and they all have integration. So I was wondering if there is such existing integration or is it possible to use Bayesian nets on a distributed system. If it can be distributed, how can it be distributed to the slaves in a distributed system.

Best Regards,

Ekin


_______________________________________________
recommenders-dev mailing list
recommenders-dev@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/recommenders-dev


Back to the top