In our first engineering blog post we discussed the Kweo core requirements and our high-level design decisions we made.
As one of the most important requirements is real-time user interaction, Application Performance Management at scale is encoded into our engineering DNA. It starts with architecture, and continues in solution design to the development ,test and production environment.
The Kweo distributed real-time data processing stack consists of the following major components:
- Netty as event-driven network application framework
- Apache Kafka as a fault-tolerant, high throughput distributed messaging system
- Storm / Trident for distributed and fault-tolerant real-time computation
- Apache Cassandra as a fault-tolerant, distributed column oriented database
We needed an APM solution which works in our development, test and production environment.
The solution needed to be able to provide:
- Infrastructure system monitoring and alerting
- Application monitoring and alerting
- Application Performance Management
- Shows response time, CPU cost, API breakdown, suspensions (garbage collection)and IO time for each trace
- Real-User monitoring
- Historical metric storage up to 365 days
- Runs in our environment and not as an external SaaS model
- Can be used for deep distributed transaction tracing and as a distributed profiler
- Always on, every transaction can be traced
- Can trace through our stack above end-to-end
- Works well in Amazon EC2
- Supports tracing through Apache, PHP, Java, zeromq and custom protocols
The APM candidates
|CA Wily Introscope||AppDynamics||NewRelic||Compuware dynaTrace|
|Operating System Agent||YES||YES||YES||YES|
|Development kits for custom tagging and instrumentation through zeromq, custom protocols||NO, classes can be instrumented but cant trace through modern protocols, zeromq, Apache Kafka or Storm||YES but limited ( cant trace through zeromq and storm)||NO ( Android and IOS coming soon)||YES via ADK (JS,JAVA,.NET, Android and IOS and native C)|
|Distributed Tracing always on||NO||NO, Sampling only||NO||YES|
|Distributed Profiler always on||NO||NO, Sampling only||NO||YES|
|Real-User monitoring||NO - Requires another product Customer Experience Manager (CEM)||YES but limited for Web 2.0/HTML5 user experience, don't track entire visits||YES but limited for Web 2.0/HTML5 user experience, don't track entire visits||YES - supports Web 2.0/HTML5, tracks entire visits and has UEM complaint resolution capability|
|RT, CPU, API break down, suspensions(GC), IO time for each transaction||NO||NO||NO||YES|
Compuware dynatrace is the APM vendor with best fit for the Kweo’s distributed data processing platform
We are able to trace through Netty -> Apache Kafka -> Storm/Trident -> zeromq -> Storm / Trident -> Cassandra
dynaTrace auto-maps the topology for each transaction (PurePath) or for entire timeframes, whilst providing instant easy to read status overviews for each node and application (cluster).
System monitoring out of the box
These monitoring views come out of the box with zero configuration. The monitoring dashboard is built-in dynaTrace and requires no further configuration.
Application Performance Management and alerting with zero configuration
As before the application monitoring views come out of the box with zero configuration. The monitoring dashboard is built-in dynaTrace and requires no further configuration.
PurePaths – the distributed stack trace with detailed performance metrics
At Kweo we are operating a highly distributed system which operates asynchronously with many different protocols. It is important to understand what a single end-user click or service call does ( or doesn’t do) in our system. Being able to trace back any distributed transaction is crucial to identifying issues quickly.
API distribution per transaction
Each PurePath can be looked at from the API distribution perspective. This is very valuable to see with one click if a specific API (including 3rd party APIs) is responsible for excessive CPU usage or slow response times.
3 -clicks to root cause – problem analysis made easy
Another effective way to quickly identify issues is to switch to the contributors tab within a PurePath. This view flattens the distributed stack trace with it’s performance metrics and sorts it by highest contributors classes and methods. It also color-codes using deviations in execution time by default.
Some things are meant to be slow, but be ready to handle the consequences
Some transaction are not really problems , because they work as designed ( password encryption is deliberately high in CPU and time ), but when they pop up as in this image below, it is a reminder to implement effective measures that these expensive service calls don’t result in a massive system failure because of overuse.
Auto generated UML sequence diagrams for each transaction (Purepath)
This helps us to validate our solution design versus real-life implementation. Things like 4000 Cassandra calls (even if it takes only a few milliseconds) for a single transaction will be right in your face telling you clearly there is a flaw (in the design or implementation).
In the following post we are explaining in details how we are tracing through Storm.