logo
logo
Sign in

From Data Lake to No Sql Table

avatar
Nataliya Karpenko
From Data Lake to No Sql Table

Intro

Eyeview is a video advertising company which specializes in delivering relevant ads to users across their devices in real-time. As for any data driven company, the quality of our data determines the power of our algorithms, and ultimately, the effectiveness of our ads. Part of the challenge we face is maintaining data integrity across many services that use it. In this post, we will focus on one challenge as it relates to audience data – a dataset which is critical for accurate ad targeting.

The Challenge of Many Hops and Data Integrity:

Audience data comes to us from external partners. It travels across five different services before it gets to its most important destination (Aerospike). There, it gets accessed during real-time bidding to match a user with an audience in a matter of milliseconds. The services consuming the data are:

  1. S3 : distributed file storage provided by AWS
  2. Spark : our big data lake
  3. AWS Kinesis :  data bus used to persist and share data via streams
  4. AWS KCL apps : data enrichment and dispatching services on top of Kinesis
  5. Aerospike : scalable key value store used for real time bidding

Each data hop presents unique challenges in terms of checking integrity and ensuring no data loss occurs. Here, we will describe an issue we found in one of these interactions – namely between Spark and Kinesis and the innovative approach we took to solve the issue.

The Challenge of Dealing with AWS Kinesis:

Sending data to Kinesis is not straightforward, due to the number of limits it imposes on the sender. The limits are expressed in terms of rate of batches sent, rate of bytes sent as well as max entries and bytes per batch. Violating these limits results in rejection of sent data via an exception. Depending on the capacity of the stream, (expressed in units of parallelism called shards) the effective limits can vary significantly. To deal with the issue we developed a component called Kinesis Buffered Client (KBC) which uses buffering and throttling to enforce rate limits. KBC uses a common worker pool pattern based on java’s executor framework (a queue for requests and a pool of thread-based workers taking items off the queue and processing them in the background).

Author Alex Belyansky

collect
0
avatar
Nataliya Karpenko
guide
Zupyak is the world’s largest content marketing community, with over 400 000 members and 3 million articles. Explore and get your content discovered.
Read more