Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [mat-dev] A suggestion: would you benefit from the 'jzran' library for random-access gzip archives?

mat-dev-bounces@xxxxxxxxxxx wrote on 05/04/2011 15:16:07:
> Tsvetkov, Krum 
> I can give some technical details about the HPRPF parser. MAT is 
> also working with IBM system dumps (which are zipped); probably 
> Andrew (who did the IBM dumps parser) could give some info if such 
> an approach could work there.
> 
> ] On Behalf Of Eugene Kirpichov
> Hello,
> 
> A while ago I wrote a library for random access to gzip archives -
> jzran http://code.google.com/p/jzran . I originally wrote it for the
> logophagus project http://code.google.com/p/logophagus , but hoped to
> find other uses for it.
> 
> The library is BSD-licensed, so basically free for any kind of usage.
> 
> I wonder if the Eclipse Memory Analyzer would benefit from it? I think
> it could be cool to open/analyze gzipped .hprof files without
> decompressing them (it's quite a frequent situation e.g. in Yandex
> among my ex-colleagues - you gzip a profile on a remote server, copy
> it to your machine, decompress and study it with yjp). Perhaps in some
> cases it could even be faster then opening uncompressed ones. Or maybe
> you could store some of your indices in compressed form and use jzran
> to read them.
> 
The mat.dtfj parser uses IBM DTFJ to read the dumps, and the
unzipping of compressed dumps is handled by DTFJ.

The current dump types handled by IBM DTFJ code are:

.phd portable heap dumps
The phd file format is a compressed binary format without random access.
It can be read by the IBM DTFJ/PHD code, or by svcdump.jar sometimes
available from IBM. The DTFJ/PHD code ends up reading a PHD file
multiple times because of the sequential nature of the file format.

.phd.gz gzipped portable heap dumps
The PHD reader code uses GZIPInputStream to uncompress the dump on the
fly before the PHD code interprets it

.dmp and .dmp.xml Core dump with XML file generated by the IBM Jextract
tool
These aren't compressed

.dmp.zip Compressed core dump and associated XML file generated by the
IBM Jextract tool together with some shared libraries from the JVM.

The DTFJ reader unzips the core dump (and libraries) to a temporary
directory. The XML is unzipped in memory as required.

javacore*.txt These are text format dumps.


So jzran can't directly help the MAT DTFJ parser as the unzipping is
done by the IBM DTFJ code. The DTFJ API only accept a File, so a
random access stream provided by jzran is not suitable.

In the DTFJ code I did try wrapping a FileCacheImageInputStream around
an InputStream from a ZipEntry. That meant that the DTFJ code could
start to read the core dump as it was being unzipped, and the 
FileCacheImageInputStream allowed random access. It didn't help much
as often data from the end of the core file was soon needed.

Random access to the PHD file would be useful for the DTFJ/PHD code.
jzran would not help for the PHD format, but conceivably could help
with random access if the file was gzipped (or else 
FileCacheImageInputStream around the ungzipped stream). That's
a question for the IBM people writing the DTFJ code though.

Andrew Johnson






Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU








Back to the top