What I learn’t from 1 day hack in building real-time streaming with AWS Lambda and Kinesis

It was a rainy Saturday which was a perfect day for hacking a real time streaming prototype on AWS.  I started thinking of Apache Kafka and Storm but then decided to give AWS Kinesis a try out as I didn’t want to spend my time in creating infrastructure with the goals of writing minimal code, zero setup and zero configuration pains to get this running.  This blog describes how easy it was to build a real-time streaming platform on AWS to do real-time applications like aggregation, counts, top 10, etc. Many key questions are also discussed:

  • Kafka vs. Kinesis – How do you decide?
  • KCL SDK vs. Lambda – Which is better to write streaming apps?
  • Streams, shards, partition keys and record structure – Design guidelines
  • Debugging
  • Automation of Kinesis in production

Sketch your conceptual streaming architecture

The first step you need to think is the type and volume of data you are planning to stream in real-time and number of real-time applications that would consume this data.

Next question to ask is how you will ingest the data in real-time from the producers of the data.  Two common ways to do this are shown below:  API Gateway + Lambda and using a native SDK streaming library (aws-sdk), the latter being much faster way to push data.

Moving on, there are Kinesis design issues such as how many real-time streams you need, as well as the ‘capacity’.  There are concepts such as shards and partitionkeys that will need to be decided at this point although this can be modified later as well.  For this prototype, I picked a few server end points with a single shard Kinesis stream as shown below.  All the Red lines were implemented as a part of this prototype.

Real-time-arch-conceptual-kinesis kinesis

Once the initial conceptual architecture for ingesting is thought through, the next most important design decisions are about the applications that will consume the real-time stream.  As seen from this diagram, there are multiple ways in which apps can consume stream data – Lambda, AWS Kinesis Client Libaray (KCL) and AWS Kinesis API, as well as Apache Storm through connectors.  In my serverless philosophy of minimal infrastructure, I went ahead with Lambda functions since my application was simple.  However, for complex apps, the best enterprise option is to use KCL or even Storm to reduce dependency and lock-in on AWS.  Also, note that KCL requires that you be a Java shop as tooling to write in other languages like JS and Ruby is simply too complex and non-intuitive to use (wrappers around MultilangDaemon).

Big data lifecycle beyond real-time streaming

Big data has a lifecycle – just real time streaming and real time applications are not enough.  Usually, in addition to real-time data stream, this data is also aggregated, archived and stored for further processing by downstream  batch processing jobs.  This will require connectors from Kinesis to downstream systems such as S3, RedShift, Storm, Hadoop, EMR and others.  Check this paper out for further details.

Now that we have thought about the data architecture and big picture, it is now time to start building this architecture.  We followed 3 steps to do this: create streams, build producers and consumers.

Create Streams

Streams can be created manually from AWS management console or programatically as shown by the Node js code snippet below.

kinesis.createStream(params, function(err, data) {
if (err && err.code !== 'ResourceInUseException') {
callback(err);
return;
}
)};

Note that streams can take several minutes to be created so it is essential to wait before using them as shown below by periodically checking whether the stream is “Active”.

kinesis.describeStream({StreamName : streamName},
function(err, data) { if (err) {callback(err); return;}
if (data.StreamDescription.StreamStatus === 'ACTIVE') {callback();}
else {
setTimeout(function()...

Helloworld Producer

This step can be done by invoking Kinesis SDK call kinesis.putRecord call.  Note that partionkey, streamname (created from previous step) and data are required to be passed in as JSON structure to this method.

var recordParams = {
Data: data, //can be just a JSON record that is pushed to Kinesis
PartitionKey: partitionKey,
StreamName: streamName
};
kinesis.putRecord(recordParams, function(err, rdata) {
//...
});

This should start sending data to the stream we just created.

Helloworld Lambda Consumer

Using the management console, a Lambda function can be written easily that listens to the stream as event source and gets invoked to do business processing analytics on a set of records in real-time.

exports.handler = function(event, context) {
event.Records.forEach(function(record) {
var payload = new Buffer(record.kinesis.data, 'base64').toString('ascii');
//do your processing here
});
context.succeed("Successfully processed " + event.Records.length + " records.");
};

In less than a day, we had created a fully operational real-time streaming platform that can scale and run applications in real-time.  Now, let us take a look at some key findings and takeaways.
Learnings

Kafka vs. Kinesis- Kafka requires a lot of management and operations effort to keep a Kafka cluster running at scale across datacenters with mirroring, keeping it secure, fault tolerant and monitoring disk space allocation. Kinesis can be thought of as “Kafka as a service” where operations and management costs are almost zero. Kafka, on the other hand, has more features such as  ‘topics’ which Kinesis doesn’t provide but this should not be a deciding factor until all the factors are considered. If your company has made a strategic decision to run on AWS, the obvious choice is to use Kinesis as it has advantage of ecosystem integration with  Lambda, AWS data sources and AWS hosted databases.  If you are mostly on-prem, or require a solution that needs to run both on-prem and SaaS, then you might not have a choice but to invest in Kafka.

Stream Processing Apps – Which is better? KCL vs. Lambda vs. API – Lambda is best used in simple use cases and is a serverless approach that requires no EC2 instances or complex managment.  KCL, on the other hand, requires a lot more operational management, and provides a more extensive library to handle complex stream applications such as worker processes, integration with other AWS services like DynamoDB, fault tolerance and load balancing streaming data.  Go with KCL unless your processing is simple or requires standard integrations to other AWS services in which case Lambda would be better.  Storm is another approach to be considered for reducing lock-in but I didn’t investigate it enough to know the challenges in using it.  Storm spouts is on my plate to evaluate it next.

Streams – How many? – It is always difficult to decide how many streams one needs. If you mix all different types of data streams in one stream, there could not only be scalability limits for each stream but also the fact that your stream processing apps will require to read everything and then filter out the records they want to process. Kinesis doesn’t have a filter and every application must read everything.  If you have independent data records requiring independent processing, then keep the streams separate.

Shards – How many? – The key decision point for this is the amount of records throughput and the amount of concurrency needed at read time during processing.  AWS management console provides a recommendation for number of shards needed.  Of course, you don’t need to get this right when creating a shard as resharding is possible later on.

Producer – small or big chunks? – It is better to minimize network calls to push data through SDK into AWS so some aggregation and chunking is needed on the client end points.  This needs to be traded-off against staleness of data.  Holding on to data and aggregating it for a few seconds should be acceptable in most cases.

Debugging – still tough – It was a bit of challenge to debug the Kinesis and Lambda as logs and CloudWatch showed metrics with delays.  Better tooling is needed here.

Kinesis DevOps Automation – For production operations, automation of Kinesis will be needed.  Two prominent capabilities needed here are stream creation per region, single pane of glass for multiple streams and stream scaling automation.  Scaling streams is still manual and error-prone but that is a good problem to have when amount of data ingest is increasing due to larger number of customers.

Final Word

In less than a day, I had a successful real-time stream running where I can push big volumes of data records from my laptop or a REST end point into AWS API gateway, to Kinesis stream and then 2 Lambda applications reading this data in real-time doing simple processing.  AWS Kinesis together with Lambda provides a compelling real time data processing platform.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s