Kweo topology

Application Performance Management for distributed real-time data processing

In our first engineering blog post we discussed the Kweo core requirements and our high-level design decisions we made.

As one of the most important requirements is real-time user interaction, Application Performance Management at scale is encoded into our engineering DNA. It starts with architecture, and continues in solution design to the development ,test and production environment.

The Kweo distributed real-time data processing stack consists of the following major components:

 

The stack

  • Netty as event-driven network application framework
  • Apache Kafka as a fault-tolerant, high throughput distributed messaging system
  • Storm / Trident for distributed and fault-tolerant real-time computation
  • Apache Cassandra as a fault-tolerant, distributed column oriented database

 

The challenge

We needed an APM solution which works in our development, test and production environment.
The solution needed to be able to provide:

  • Infrastructure system monitoring and alerting
  • Application monitoring and alerting
  • Application Performance Management
  • Shows response time, CPU cost, API breakdown, suspensions (garbage collection)and IO time for each trace
  • Real-User monitoring
  • Historical metric storage up to 365 days
  • Runs in our environment and not as an external SaaS model
  • Can be used for deep distributed transaction tracing and as a distributed profiler
  • Always on, every transaction can be traced
  • Can trace through our stack above end-to-end
  • Works well in Amazon EC2
  • Supports tracing through Apache, PHP, Java, zeromq and custom protocols

 

The APM candidates

 

Selection Criteria

  CA Wily Introscope AppDynamicsNewRelicCompuware dynaTrace
JavaYESYESYESYES
Apache WebserverYESYESYESYES
Operating System AgentYESYESYESYES
PHPNOYESYESYES
Development kits for custom tagging and instrumentation through zeromq, custom protocolsNO, classes can be instrumented but cant trace through modern protocols, zeromq, Apache Kafka or StormYES but limited ( cant trace through zeromq and storm)NO ( Android and IOS coming soon)YES via ADK (JS,JAVA,.NET, Android and IOS and native C)
Distributed Tracing always onNONO, Sampling onlyNOYES
Distributed Profiler always onNONO, Sampling onlyNOYES
Real-User monitoringNO - Requires another product Customer Experience Manager (CEM)YES but limited for Web 2.0/HTML5 user experience, don't track entire visitsYES but limited for Web 2.0/HTML5 user experience, don't track entire visitsYES - supports Web 2.0/HTML5, tracks entire visits and has UEM complaint resolution capability
On-Premis SolutionYESYESNOYES
RT, CPU, API break down, suspensions(GC), IO time for each transactionNONONOYES

Compuware dynatrace is the APM vendor with best fit for the Kweo’s distributed data processing platform

We are able to trace through Netty -> Apache Kafka -> Storm/Trident -> zeromq -> Storm / Trident -> Cassandra 

 

Topology Views

dynaTrace auto-maps the topology for each transaction (PurePath) or for entire timeframes, whilst providing instant easy to read status overviews for each node and application (cluster).

 

Kweo topology

Compuware dynaTrace auto-mapped topology tracing through Netty, Aapche Kafka, Storm/Trident, zeromq and Apache Cassandra

 

 

 

System monitoring out of the box

These monitoring views come out of the box with zero configuration. The monitoring dashboard is built-in dynaTrace and requires no further configuration.

 

dynatrace host monitoring

dynatrace host monitoring out of the box

Application Performance Management and alerting with zero configuration

As before the application monitoring views come out of the box with zero configuration. The monitoring dashboard is built-in dynaTrace and requires no further configuration.

dynatrace application monitoring

dynatrace application monitoring out of the box

PurePaths – the distributed stack trace with detailed performance metrics

At Kweo we are operating a highly distributed system which operates asynchronously with many different protocols. It is important to understand what a single end-user click or service call does ( or doesn’t do)  in our system. Being able to trace back any distributed transaction is crucial to identifying issues quickly.

Purepath

dynaTrace Purepath

API distribution per transaction

Each PurePath can be looked at from the API distribution perspective. This is very valuable to see with one click if a specific API (including 3rd party APIs)  is responsible for excessive CPU usage or slow response times.

API-contribution

3 -clicks to root cause – problem analysis made easy

Another effective way to quickly identify issues is to switch to the contributors tab within a PurePath. This view flattens the distributed stack trace with it’s performance metrics and sorts it by highest contributors classes and methods. It also color-codes using deviations in execution time by default.

 

solr_slow_sync

One click one a Purepath and one click on Contributors tab identify the culprits (auto color-coded)

 

Some things are meant to be slow, but be ready to handle the consequences

Some transaction are not really problems , because they work as designed ( password encryption is deliberately high in CPU and time ), but when they pop up as in this image below, it is a reminder to implement effective measures that these expensive service calls don’t result in a massive system failure because of overuse.

dynaTrace API contribution

dynaTrace API contribution

 

Auto generated UML sequence diagrams for each transaction (Purepath)

This helps us to validate our solution design versus real-life implementation. Things like 4000 Cassandra calls (even if it takes only a few milliseconds) for a single transaction will be right in your face telling you clearly there is a flaw (in the design or implementation).

dynaTrace  Sequence Diagrams

dynaTrace auto generated UML Sequence Diagrams

 

In the following post we are explaining in details how we are tracing through Storm.

 

 

5 replies
  1. Luke
    Luke says:

    why is Riverbed’s OPNET AppInternals product not included in this? Certainly you’d think that someone ahead of New Relic and AppDynamics by Gartner’s standards in their APM MQ, would at least be evaluated?

  2. Sudhaker Raj
    Sudhaker Raj says:

    Thanks for wonderful comparison and insights – Great job. We are using dynaTrace in production for over 2 years and extremely happy with it.

    How did you instrument Netty? There is no out-of-box sensor pack for it.

Comments are closed.