Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [mat-dev] ​Is it necessary to synchronize IntIndexCollector.get/set during Pass2Parser?

Hi Yi --

I've just merged a few of these patches in. You could try a build from the latest `master` branch, or wait for the nightly. Let me know how it goes.

thanks
Jason

On Sun, Dec 31, 2023 at 2:17 PM Jason Koch <jkoch@xxxxxxxxxxx> wrote:
Also, -- I left 570670 patch #5 out (FJ-based. GarbageCleaner.clean()) since I couldn't recall the specific situation that it helped.


Thanks
Jason

On Sun, Dec 31, 2023 at 2:12 PM Jason Koch <jkoch@xxxxxxxxxxx> wrote:
Hi Yi Yang

I've split 570670 to separate patches -- 582814, 582813, 582818.

I also added some extras that we run internally too: 582820, 528821, 582822, especially for slow-IO situations such as S3/EBS/NFS-stored indexes and hprof.

Regarding the synchronized in set() block, I think we can make this concurrent across sets to different indices by CASing each element, but I don't think it can be made concurrent safe to the same index. There is no double-word CAS available in Java, so there's not a lot we can do to guarantee atomic change to two neighbouring long array elements. Do you have any ideas/suggestions?

Andrew / Krum - I've tried to make all patches standalone. Please bear with me on Gerrit as this is my first project using it.

Thanks
Jason

On Thu, Jun 8, 2023 at 1:59 PM Jason Koch <jkoch@xxxxxxxxxxx> wrote:


On Wed, Jun 7, 2023 at 7:21 PM Yi Yang <qingfeng.yy@xxxxxxxxxxxxxxx> wrote:

I have briefly read through your optimization patch and found that there are indeed many areas of overlap, such as the 570670 2+3 patch, which is exactly what I wanted to do. I am happy to see these patches being integrated into the mainline, but it seems that they have been put on hold for a long time.


Ah, of course, this explains why I do not see it as a problem. We run a variant of #570670 on our internal MAT build over 2 years.

The patches work, but I have had no time to rebase + clean + complete the reviews.
 
> On my analysis the majority of time was no longer in the earlier parsing stage which was running close to speed of I/O, and has moved to later stages (such as GC)

I'm not quite understanding this. Have I missed something? Based on my multiple CPU/Wall analyses (from 8g to 40g), the long processing time of parsing always seems to occur stably in addObject, and I have never seen long processing time on GC. Are there any more clues in this regard? I used G1 to parse a 40g heap, and the JVM's maximum heap was 56g. All GC time was about 15s (FullGC 3s + YoungGC 11s + MixedGC 1s).


You did not miss =) . Problem was my poor choice of use of "GC". Issue is not JVM Garbage Collector but instead: MAT GarbageCleaner.clean() that becomes one of the longer stages. With #570670 applied both addObject performance and GarbageCleaner.clean performance is also improved by running GarbageCleaner in fj-pool.
 
If I understand correctly, the more objects in the heap, the more times addObject will be called, and simple optimizations to addObject always lead to significant performance improvements.

Yes I agree, for pass1 and first part of pass2 stages, addObject performance is the key scaling bottleneck.
 

SnapshotFactory.openSnapshot
└── SnapshotFactoryImpl.parse
    ├── HprofIndexBuilder.fill
    │   ├── HprofParserHandlerImpl.beforePass1
    │   ├── Pass1Parser.read
    │   │   ├── read header&version
    │   │   ├── read identifierSize
    │   │   ├── ....
    │   │   └── read anything else about objects
    │   ├── HprofParserHandlerImpl.beforePass2
    │   ├── Pass2Parser.read
    │   │   ├── read heap segment in parallel
    │   │   └── HprofParserHandlerImpl.addObject in parallel
    │   ├── HprofParserHandlerImpl.fillIn
    │   ├── IntArray1NWriter.writeTo
    │   ├── IntIndexCollector.writeTo
    │   └── LongIndexCollector.writeTo
    ├── GarbageCleaner.clean
    ├── SnapshotImpl.calculateDominatorTree
    └── SnapshotImpl.calculateMinRetainedHeapSizeForClasses
HprofParserHandlerImpl.addObject
├── HprofParserHandlerImpl.prepareHeapObject
├── HprofParserHandlerImpl.mapAddressToId
├── IntArray1NWriter.log
├── IntIndexCollector.set
└── LongIndexCollector.set

--original
testParseDump(org.eclipse.mat.tests.parser.TestParse)  Time elapsed: 1,898.853 s
testParseDump(org.eclipse.mat.tests.parser.TestParse)  Time elapsed: 1,868.729 s
testParseDump(org.eclipse.mat.tests.parser.TestParse)  Time elapsed: 1,845.501 s
--without_IndexCollector_lock_and_nocompressed
testParseDump(org.eclipse.mat.tests.parser.TestParse)  Time elapsed: 1,334.532 s
testParseDump(org.eclipse.mat.tests.parser.TestParse)  Time elapsed: 1,300.817 s
testParseDump(org.eclipse.mat.tests.parser.TestParse)  Time elapsed: 1,303.18 s
testParseDump(org.eclipse.mat.tests.parser.TestParse)  Time elapsed: 1,330.146 s
testParseDump(org.eclipse.mat.tests.parser.TestParse)  Time elapsed: 1,335.981 s
testParseDump(org.eclipse.mat.tests.parser.TestParse)  Time elapsed: 1,335.981 s
testParseDump(org.eclipse.mat.tests.parser.TestParse)  Time elapsed: 1,325.994 s
testParseDump(org.eclipse.mat.tests.parser.TestParse)  Time elapsed: 1,333.654 s
testParseDump(org.eclipse.mat.tests.parser.TestParse)  Time elapsed: 1,362.704 s
testParseDump(org.eclipse.mat.tests.parser.TestParse)  Time elapsed: 1,326.175 s

Thanks,
Yi Yang

------------------------------------------------------------------
From:Jason Koch <jkoch@xxxxxxxxxxx>
Send Time:2023 Jun. 8 (Thu.) 01:42
To:"YANG, Yi" <qingfeng.yy@xxxxxxxxxxxxxxx>; Memory Analyzer Dev list <mat-dev@xxxxxxxxxxx>
Cc:Andrew Johnson <andrew_johnson@xxxxxxxxxx>
Subject:Re: [mat-dev] Is it necessary to synchronize IntIndexCollector.get/set during Pass2Parser?

This would be great to see. On my analysis the majority of time was no longer in the earlier parsing stage which was running close to speed of I/O, and has moved to later stages (such as GC). So, if you see contention in earlier stages, then, yes we could look to fix it.

In terms of optimisations I think it would be great to get your opinion on these two as well, as I think they all hit the same indexing layer.

I think there's quite some overlap here and it would be good to arrive at a decision point of the efficacy in practice of the compressed arrays, with current world Java applications. The apps we use are typically large (minimum heap size is ~5GB, many apps are 100+GB), and a longer term plan to address this along with 2bn object count would be beneficial.




On Tue, Jun 6, 2023 at 8:09 PM Yi Yang via mat-dev <mat-dev@xxxxxxxxxxx> wrote:
> Is LongIndexCollector object2position also contended?LongIndexCollectorUncompressed would also be an option.

Yes, it is also highly contended. All things I mentioned implicitly include Long counterpart. According to my profiling results, all the code in HprofParserHandlerImpl.addObject() is highly contended, which is the code I expect to mainly optimize in the future, including:
```
mapAddressToId           // resolveClassHierarchy
ClassImpl.addInstance()  // two atomic cas
IntArray1NWriter.log()      // lock in publishTasks
IntIndexColletor.set()      // lock
LongIndexCollector.set()    // lock
```

> That’s interesting and using IntIndexCollectorUncompressed is a simple change.
> However, these choices will use more memory. With 16,000,000 objects, the object2classId index could increase from 48MB to 64MB and object2position from perhaps 60MB to 128MB.

Yes, but in our production environment, our machines have a large amount of physical memory, and our users are more concerned with analysis speed. So, I think an optional solution is to allow users to choose whether to use the compressed version based on parameters (such as -DcompressInternalArray). Additionally, IndexWriter and IndexReader seem to be two opposing classes. Maybe simply replacing IntIndexCollector with IntIndexCollectorUncompressed is not enough, and perhaps a new IntIndexReaderUncompressed needs to be added? Anyway, it is possible to directly modify the underlying array of ArrayIntCompressed.

------------------------------------------------------------------
From:Andrew Johnson <andrew_johnson@xxxxxxxxxx>
Send Time:2023 Jun. 6 (Tue.) 20:37
To:Memory Analyzer Dev list <mat-dev@xxxxxxxxxxx>
Subject:Re: [mat-dev] Is it necessary to synchronize IntIndexCollector.get/set during Pass2Parser?

That’s interesting and using IntIndexCollectorUncompressed is a simple change.

Is LongIndexCollector object2position also contended? LongIndexCollectorUncompressed would also be an option.

 

However, these choices will use more memory. With 16,000,000 objects, the object2classId index could increase from 48MB to 64MB and object2position from perhaps 60MB to 128MB.

 

We could select depending on how much heap is available. We do something similar: to disable parallel parsing when memory is short. However, I have found that MAT can get stuck when the heap size is on the boundary of being too small.

org.eclipse.mat.tests.regression.Application -performance
can run with heap between two -Xmx values using a binary search to find the smallest heap which works.

 

I’d like to fix ‘Bug 581932 - ArrayIndexOutOfBoundsException in ArrayIntCompressed on beforePass2 parsing’ before making further changes to the HPROF parser. Do you have an idea for solving this bug?

 

Andrew Johnson

 

 

From: Yi Yang
Sent: Thursday, June 1, 2023 4:28 AM
To: Andrew Johnson; Memory Analyzer Dev list <mat-dev@xxxxxxxxxxx>
Cc: jkoch
Subject: Re: [mat-dev] Is it necessary to synchronize IntIndexCollector.get/set during Pass2Parser?


Hi Jason and Andrew,

 

> To confirm, are you referring to this line?

Yes, this is exactly what I refer to.

 

According to profiling data for wall time and CPU time, there is high contention in IntIndexCollection.set and get, which takes up a significant amount of time.

 

Experiment#1

We can break down the problem into two parts:

- `synchronized(){}` between IntIndexCollection.set and IntIndexCollection.get: there is no time overlap between set and get, so we don't need to synchronize get.

- synchronization within IntIndexCollection.set: set is called concurrently, and as aforementioned, even if two indices are different, the final byte[] range may overlap, making it difficult to remove the lock.

Given that IntIndexCollection.set is highly contended, it is without a doubt that the lock will significantly impact performance. We need to find another way to handle this issue.

 

Experiment#2

Then I replace ArrayIntCompressed with an uncompressed int[] version and remove all locks within IntIndexCollection.set and get methods. When testing the parsing of 40GB heap, I achieved a 32.5% performance improvement stably. But it seems that IntIndexCollectorUncompressed mentioned by Andrew is what I want

 

--original

Time elapsed: 1,898.853 s

Time elapsed: 1,868.729 s

Time elapsed: 1,845.501 s

--remove_locks_and_use_int[]_in_ArrayIntCompressed

Time elapsed: 1,280.707 s

Time elapsed: 1,282.467 s

--use IntIndexCollectorUncompressed

Time elapsed: 1,318.243 s

 

Thanks,

Yi Yang

 

------------------------------------------------------------------

From:Andrew Johnson

Send Time:2023 Jun. 1 (Thu.) 04:47

To:"YANG, Yi" >; Memory Analyzer Dev list <mat-dev@xxxxxxxxxxx>

Subject:RE: [mat-dev] Is it necessary to synchronize IntIndexCollector.get/set during Pass2Parser?

 

Thank you for your interest in improving the performance of MAT.

 

The parsing code was enhanced to have some parallelism, and some of the index code was enhanced to be thread safe.

The index code in org.eclipse.mat.parser is an MAT API, so we can’t really make incompatible changes as an adopter of MAT might already be using it and presuming some thread safety. We should improve the Javadoc to explain what is safe.

 

I think your example is for code in HprofParserHandlerImpl.java such as:
        // log address

        object2classId.set(index, classIndex);

        object2position.set(index, object.filePosition);

Yes, the index is separate for separate threads, but IntIndexCollector then chooses a page based on the index giving an ArrayIntCompressed object in which to set the value. This internally has a byte array but the value can be spread across multiple bytes as the class index is known to have leading zeroes so only certain bits need to be stored. This means two threads could access the same byte in the byte array. So, synchronization is needed. It might be possible to use AtomicIntegerArray instead of a byte[] but that could be slower. Is there contention on the lock? The page size is 1000000 so if the number of objects in the heap is smaller than this then they will all end up on the same ArrayIntCompressed. If the number is much larger then there might still be a lot of contention as the objects in the HPROF are often in address order so the threads might be processing objects with a similar object index, so even reducing the page size might not help. Is using IntIndexCollectorUncompressed any faster (at the cost of more memory)? There may be similar problems with the object2position index.

 

For more performance work we would like some reproducible tests – there are some performance tests in org.eclipse.mat.tests but we don’t run them.

 

Regards,

 

Andrew Johnson

 

 

 

From: mat-dev <mat-dev-bounces@xxxxxxxxxxx> On Behalf Of Yi Yang via mat-dev
Sent: Monday, May 29, 2023 3:24 AM

Hi all, when building hprof index, MAT first creates Pass2Parser and concurrently parses objects and adds them to ArrayIntCollector. Then, in the subsequent fillIn routine, it reads objects from ArrayIntCollector in order and writes them to index file. The entire process is shown below:

HprofIndexBuilder

  - Pass2Parser.read

    - Pass2Parser.readSegment

      - addObject to IntIndexCollector in parallel by calling IntIndexCollector.set(key, value)

  - fillIn

  - write IntIndexCollector to file by calling IntIndexCollector.get(key)

 

It seems that there is no time overlap between IntIndexCollector.get and IntIndexCollector.set? Additionally, when calling ArrayIntCollector.set(key, value) in addObject, the key is the object's ID and is unique. So, is it really necessary to synchronize get/set under synchronized protection? I observed that addObject has considerable performance overhead, and removing this lock protection results in a 19.3% performance improvement.

 

 

Best regards

Yi Yang

Unless otherwise stated above:

IBM United Kingdom Limited
Registered in England and Wales with number 741598
Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU

Unless otherwise stated above:

IBM United Kingdom Limited
Registered in England and Wales with number 741598
Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
_______________________________________________
mat-dev mailing list
mat-dev@xxxxxxxxxxx
To unsubscribe from this list, visit https://www.eclipse.org/mailman/listinfo/mat-dev

Back to the top