Posts

Kweo topology

Application Performance Management for distributed real-time data processing

In our first engineering blog post we discussed the Kweo core requirements and our high-level design decisions we made.

As one of the most important requirements is real-time user interaction, Application Performance Management at scale is encoded into our engineering DNA. It starts with architecture, and continues in solution design to the development ,test and production environment.

The Kweo distributed real-time data processing stack consists of the following major components:

 

The stack

  • Netty as event-driven network application framework
  • Apache Kafka as a fault-tolerant, high throughput distributed messaging system
  • Storm / Trident for distributed and fault-tolerant real-time computation
  • Apache Cassandra as a fault-tolerant, distributed column oriented database

 

The challenge

We needed an APM solution which works in our development, test and production environment.
The solution needed to be able to provide:

  • Infrastructure system monitoring and alerting
  • Application monitoring and alerting
  • Application Performance Management
  • Shows response time, CPU cost, API breakdown, suspensions (garbage collection)and IO time for each trace
  • Real-User monitoring
  • Historical metric storage up to 365 days
  • Runs in our environment and not as an external SaaS model
  • Can be used for deep distributed transaction tracing and as a distributed profiler
  • Always on, every transaction can be traced
  • Can trace through our stack above end-to-end
  • Works well in Amazon EC2
  • Supports tracing through Apache, PHP, Java, zeromq and custom protocols

 

The APM candidates

 

Selection Criteria

  CA Wily Introscope AppDynamicsNewRelicCompuware dynaTrace
JavaYESYESYESYES
Apache WebserverYESYESYESYES
Operating System AgentYESYESYESYES
PHPNOYESYESYES
Development kits for custom tagging and instrumentation through zeromq, custom protocolsNO, classes can be instrumented but cant trace through modern protocols, zeromq, Apache Kafka or StormYES but limited ( cant trace through zeromq and storm)NO ( Android and IOS coming soon)YES via ADK (JS,JAVA,.NET, Android and IOS and native C)
Distributed Tracing always onNONO, Sampling onlyNOYES
Distributed Profiler always onNONO, Sampling onlyNOYES
Real-User monitoringNO - Requires another product Customer Experience Manager (CEM)YES but limited for Web 2.0/HTML5 user experience, don't track entire visitsYES but limited for Web 2.0/HTML5 user experience, don't track entire visitsYES - supports Web 2.0/HTML5, tracks entire visits and has UEM complaint resolution capability
On-Premis SolutionYESYESNOYES
RT, CPU, API break down, suspensions(GC), IO time for each transactionNONONOYES

Compuware dynatrace is the APM vendor with best fit for the Kweo’s distributed data processing platform

We are able to trace through Netty -> Apache Kafka -> Storm/Trident -> zeromq -> Storm / Trident -> Cassandra 

 

Topology Views

dynaTrace auto-maps the topology for each transaction (PurePath) or for entire timeframes, whilst providing instant easy to read status overviews for each node and application (cluster).

 

Kweo topology

Compuware dynaTrace auto-mapped topology tracing through Netty, Aapche Kafka, Storm/Trident, zeromq and Apache Cassandra

 

 

 

System monitoring out of the box

These monitoring views come out of the box with zero configuration. The monitoring dashboard is built-in dynaTrace and requires no further configuration.

 

dynatrace host monitoring

dynatrace host monitoring out of the box

Application Performance Management and alerting with zero configuration

As before the application monitoring views come out of the box with zero configuration. The monitoring dashboard is built-in dynaTrace and requires no further configuration.

dynatrace application monitoring

dynatrace application monitoring out of the box

PurePaths – the distributed stack trace with detailed performance metrics

At Kweo we are operating a highly distributed system which operates asynchronously with many different protocols. It is important to understand what a single end-user click or service call does ( or doesn’t do)  in our system. Being able to trace back any distributed transaction is crucial to identifying issues quickly.

Purepath

dynaTrace Purepath

API distribution per transaction

Each PurePath can be looked at from the API distribution perspective. This is very valuable to see with one click if a specific API (including 3rd party APIs)  is responsible for excessive CPU usage or slow response times.

API-contribution

3 -clicks to root cause – problem analysis made easy

Another effective way to quickly identify issues is to switch to the contributors tab within a PurePath. This view flattens the distributed stack trace with it’s performance metrics and sorts it by highest contributors classes and methods. It also color-codes using deviations in execution time by default.

 

solr_slow_sync

One click one a Purepath and one click on Contributors tab identify the culprits (auto color-coded)

 

Some things are meant to be slow, but be ready to handle the consequences

Some transaction are not really problems , because they work as designed ( password encryption is deliberately high in CPU and time ), but when they pop up as in this image below, it is a reminder to implement effective measures that these expensive service calls don’t result in a massive system failure because of overuse.

dynaTrace API contribution

dynaTrace API contribution

 

Auto generated UML sequence diagrams for each transaction (Purepath)

This helps us to validate our solution design versus real-life implementation. Things like 4000 Cassandra calls (even if it takes only a few milliseconds) for a single transaction will be right in your face telling you clearly there is a flaw (in the design or implementation).

dynaTrace  Sequence Diagrams

dynaTrace auto generated UML Sequence Diagrams

 

In the following post we are explaining in details how we are tracing through Storm.

 

 

Engineering at Kweo.com

M-Square is the developer and operator of Kweo.com a highly scalable User Engagement Platform providing embeddable widgets to website operators and online media companies. Kweo is designed to handle millions of concurrent users and provides features like real-time comments with reddit-like ranking, polls/voting on massive scale to its subscribers.

As real-time is crucial, the architecture of Kweo is fundamental to our promise that our subscribers and their end users have the best user experience possible.

This is the first article of six in which we will discuss our architecture and design decisions made.

 

Core Requirements

 

  • All Kweo widgets must be able to run on Kweo subscribers websites without blocking site content or performance impact
  • Users don’t experience a noticeable lag, whenever users perform actions.
  • All actions should feel instantly executed
  • Works on major platforms including mobile platforms
  • Kweo must be able to store and process millions of events per second (that’s important for polling/voting during major global events)
  • Kweo must be cost effective to operate / maintain
  • Must be fault-tolerant, highly available (works across datacenter and geographic regions)
  • Ability to detect instantly abnormal deviations in the Kweo stack, auto-recovery and auto-scaling of infrastructure and applications

 

Kweo topology

 

Design Decisions

 

The core stack

 

 

DevOps

 

The future blog posts will describe in detail interesting topics about libraries, problems solved and challenges in our daily engineering work.

 

Read the following blog posts:

Application Performance Management for distributed real-time data processing