Jvm profiler for tracing distributed jvm applications

原文链接

While Spark makes data technology more accessible, right-sizing the resources allocated to Spark applications
and optimizing the operational efficiency of our data infrastructure requires more fine-grained insights about these systems, namely their resource usage patterns.

Our existing tools could only monitor server-level metrics and did not gauge metrics for individual applications. We needed a solution that could collect metrics for each process and correlate them across processes for each application. Additionally, we do not know when these processes will launch and how long they will take. To be able to collect metrics in this environment,
the profiler needs to be launched automatically with each process.

What does the JVM Profiler do?

The JVM Profiler is composed of three key features that make it easier to collect performance and resource usage metrics, and then serve these metrics (e.g. Apache Kafka) to other systems for further analysis:

A java agent: By incorporating a Java agent into our profiler, users can collect various metrics (e.g. CPU/memory usage) and stack traces for JVM processes in a distributed way.
Advanced profiling capabilities: Our JVM Profiler allows us to trace arbitrary Java methods and arguments in the user code without making any actual code changes. This feature can be used to trace HDFS NameNode RPC call latency for Spark applications and identify slow method calls. It can also trace the HDFS file paths each Spark application reads or writes to identify hot files for further optimization.
Data analytics reporting: At Uber, we use the profiler to report metrics to Kafka topics and Apache Hive tables, making data analytics faster and easier.

Typical use cases

Right-size executor: We use memory metrics from the JVM Profiler to track actual memory usage for each executor so we can set the proper value for the Spark “executor-memory” argument.
Monitor HDFS NameNode RPC latency: We profile methods on the class org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB in a Spark application and identify long latencies on NameNode calls. We monitor more than 50 thousand Spark applications each day with several billions of such RPC calls.
Monitor driver dropped events: We profile methods like org.apache.spark.scheduler.LiveListenerBus.onDropEvent to trace situations during which the Spark driver event queue becomes too long and drops events.
Trace data lineage: We profile file path arguments on the method org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations and org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock to trace what files are read and written by the Spark application.

repo地址

入口函数为Agent.java

public final class Agent {

    private static AgentImpl agentImpl = new AgentImpl();

    private Agent() {
    }

    public static void agentmain(final String args, final Instrumentation instrumentation) {
        premain(args, instrumentation);
    }

    public static void premain(final String args, final Instrumentation instrumentation) {
        System.out.println("Java Agent " + AgentImpl.VERSION + " premain args: " + args);

        Arguments arguments = Arguments.parseArgs(args);
        arguments.runConfigProvider();
        agentImpl.run(arguments, instrumentation, null);
    }
}