yaluotao@web:~
×
serverless-tracking-system .md
EN 中文
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
---
title: "Building a Serverless Impression Tracking System from Scratch: Design Decisions and AWS Data Pipeline in Practice"
date: 2025-12-02
lang: en
---

Building a Serverless Impression Tracking System from Scratch: Design Decisions and AWS Data Pipeline in Practice

Introduction

I recently needed to design an impression tracking system to measure link impressions across partner websites. While the requirements seemed straightforward, the implementation surfaced several interesting engineering challenges. This post documents my design decisions and shares some lessons learned while building a data pipeline with AWS S3, Glue, and Athena.

Note: This system just went live recently. I’ll update this post as we gather more production experience and metrics.

Why Not Just Use the Main Application Database?

Before diving into the architecture, it’s worth explaining why I didn’t simply write events directly to our main application database.

The most straightforward approach would be: Client → API → Sidekiq Job → PostgreSQL. Simple, familiar, and uses existing infrastructure. However, this approach has significant scalability concerns:

  1. Database load: Impression events can generate extremely high traffic—potentially millions of writes per day. Even with async processing via Sidekiq, this volume would put substantial pressure on our primary database, competing with critical business transactions.

  2. Queue saturation: Sidekiq queues have finite capacity. A traffic spike in impression events could flood the queue, delaying other important background jobs like email delivery or payment processing.

  3. Operational risk: Coupling analytics data ingestion with the main application creates unnecessary risk. An analytics traffic surge shouldn’t be able to degrade core application performance.

  4. Cost inefficiency: Row-oriented databases like PostgreSQL aren’t optimized for analytics queries. Aggregating millions of events would be slow and resource-intensive.

This led me to a dedicated analytics pipeline—decoupled from the main application, optimized for high-volume writes and analytical queries.

1. Requirements Analysis

The Business Context

Our publishers integrate our content in two ways:

  1. Inline links: Publishers insert our links directly into their articles and content
  2. Embedded widgets: Publishers embed our widget via iframe, which contains multiple links

To track impressions across both scenarios, we deploy our JavaScript SDK in two places:

  • On the publisher’s page: to track inline links and monitor iframe visibility
  • Inside the iframe: to track links within the widget

These two SDK instances work together—the parent page SDK tells the iframe SDK when the iframe is actually visible to the user, enabling accurate impression tracking regardless of where links appear.

Core Requirements

The technical requirements boil down to:

  1. Frontend collection: Detect when specific links truly enter the user’s viewport
  2. Data transmission: Reliably send impression events to the backend
  3. Data storage: Store massive event data cost-effectively
  4. Data analysis: Support flexible SQL queries for business analytics

Simple enough on the surface, but “entering the viewport” deserves careful definition. Does a link count as an impression the moment it appears on the page? What if the user scrolls past it instantly? What if the link is inside an iframe that’s been scrolled out of the parent page’s viewport?

2. Client SDK Design

2.1 Defining a “Real Impression”

Not every appearance counts as a valid impression. The definition needs two key parameters:

  1. Visibility threshold: What percentage of the element must be visible?
  2. Duration threshold: How long must the element remain visible?

The reasoning behind tuning these:

  • A visibility threshold too high would miss many valid impressions (users might notice content even when only partially visible)
  • A duration too short would count “phantom impressions” during fast scrolling
  • A duration too long would miss genuine impressions during quick browsing

The exact values depend on your business requirements and can be tuned based on data analysis.

2.2 The Power of IntersectionObserver

The naive approach to detecting element visibility is listening to scroll events and calculating positions manually. This has two problems: high performance overhead and complex calculation logic.

Modern browsers provide the IntersectionObserver API, which efficiently reports when an element enters or leaves the viewport and its visibility ratio. Basic usage:

const observer = new IntersectionObserver((entries) => {
  entries.forEach(entry => {
    if (entry.isIntersecting && entry.intersectionRatio >= VISIBILITY_THRESHOLD) {
      // Element meets visibility threshold
      startTimer(entry.target);
    } else {
      clearTimer(entry.target);
    }
  });
}, {
  threshold: [0, VISIBILITY_THRESHOLD]
});

observer.observe(linkElement);

2.3 Coordinating Parent and iframe SDKs

This is where our dual-SDK architecture becomes essential. Remember: the SDK runs both on the publisher’s page and inside our iframe widget. For inline links on the publisher’s page, tracking is straightforward. But for links inside the iframe, we face a challenge: a link being visible within the iframe’s viewport doesn’t mean the user can see it—the iframe itself might be scrolled out of the parent page’s viewport.

The solution is cross-frame communication between the two SDK instances:

// Parent page SDK: Monitor iframe visibility and notify the iframe SDK
const iframeObserver = new IntersectionObserver((entries) => {
  entries.forEach(entry => {
    iframe.contentWindow.postMessage({
      type: 'visibility-update',
      isVisible: entry.isIntersecting,
      visibleRegion: entry.intersectionRect
    }, '*');
  });
}, { threshold: [0, 0.1, 0.5, 1.0] });

iframeObserver.observe(iframe);

// iframe SDK: Receive visibility info from parent SDK
window.addEventListener('message', (e) => {
  if (e.data.type === 'visibility-update') {
    parentVisibility = e.data;
    // Re-evaluate true visibility of internal links
    // A link is only "visible" if: visible in iframe viewport AND iframe visible in parent viewport
  }
});

2.4 Data Transmission Strategy

Frequent requests increase server load and network overhead, but excessive delays risk losing data when users leave the page. My strategy:

  • Batching: Accumulate up to 20 events or wait 2 seconds before sending
  • Maximum wait time: Regardless of batch size, flush within 5 seconds of the first event
  • Use sendBeacon on page unload: This API guarantees request completion when the page closes
// sendBeacon is more reliable than fetch during page unload
window.addEventListener('visibilitychange', () => {
  if (document.visibilityState === 'hidden') {
    const blob = new Blob([JSON.stringify(events)], { type: 'text/plain' });
    navigator.sendBeacon(apiUrl, blob);
  }
});

2.5 Payload Compression

Impression events fire frequently, and payload size directly impacts bandwidth costs. I used field name abbreviation:

// Full format
{ event_type: 'link_impression', timestamp: 1701234567890, link: '...', label: '...' }

// Compressed format
{ et: 'link_impression', ts: 1701234567890, l: '...', lb: '...' }

Additionally, shared fields like page URL and referrer are hoisted to the batch level to avoid redundant transmission.

3. Backend Architecture Design

3.1 Why Columnar Storage for Analytics

Before discussing the architecture, let’s understand why columnar storage is ideal for analytics workloads.

Row-oriented vs Columnar Storage

Traditional databases like PostgreSQL store data row by row. When you query SELECT page, COUNT(*) FROM events GROUP BY page, the database must read entire rows even though you only need one column. For a table with 20 columns and 10 million rows, this is extremely wasteful.

Columnar databases store data column by column. The same query only reads the page column—potentially 20x less I/O. This becomes even more impactful with compression: similar values within a column compress much better than mixed values across a row.

Why Parquet?

Parquet is an open-source columnar file format that brings these benefits to file-based storage:

  • Columnar layout: Only read columns needed for your query
  • Efficient compression: SNAPPY compression typically achieves 3-5x compression ratios
  • Predicate pushdown: Skip reading data blocks that don’t match your filters
  • Schema embedded: Self-describing format, no external schema required for basic reads
  • Ecosystem support: Works with Spark, Athena, BigQuery, DuckDB, and virtually every analytics tool

For our use case, Parquet stored in S3 provides durability, scalability, and cost-effectiveness. S3’s pricing (roughly $0.023/GB/month for Standard) is orders of magnitude cheaper than database storage.

3.2 Serverless First

For event collection systems like this, traffic patterns are often highly variable—peak traffic can be 10x normal levels. The pay-per-use and auto-scaling characteristics of serverless architecture are ideal for this scenario.

The final architecture:

Client SDK → CloudFront → API Gateway → Lambda → Kinesis Firehose → S3 (Parquet)

                                                                      Athena

3.3 Why Kinesis Firehose

Initially, I considered having Lambda write directly to S3, but quickly identified the problems:

  • Writing one file per request creates massive numbers of small files, severely impacting query performance
  • Implementing buffer logic ourselves is both complex and error-prone

Firehose solves these problems elegantly:

  • Automatic buffering, configurable by size (e.g., 128MB) or time (e.g., 15 minutes)
  • Built-in Parquet format conversion
  • Automatic time-based partitioning when writing to S3

Understanding How Athena Queries Work

To appreciate why file size matters, let’s understand how Athena processes queries. When you run a query, Athena:

  1. Lists all files in the relevant S3 partitions
  2. Opens each file and reads the Parquet metadata (column statistics, row groups)
  3. Applies predicate pushdown to skip irrelevant row groups
  4. Reads only the required columns from matching row groups

Each file incurs overhead: S3 API calls, connection establishment, and metadata parsing. With thousands of tiny files (say, 100KB each), this overhead dominates query time. With fewer, larger files (say, 128MB each), the overhead becomes negligible relative to actual data scanning.

Buffer Settings: The Trade-offs

Firehose flushes data to S3 when either the size threshold or the time threshold is reached—whichever comes first.

The considerations:

  • Size threshold: Larger files generally improve query performance by reducing per-file overhead. However, extremely large files (over 256MB) can slow down queries because Athena can’t parallelize within a single file as effectively. The optimal size depends on your query patterns and data volume—this requires experimentation.

  • Time threshold: This ensures data doesn’t sit in the buffer too long during low-traffic periods. For near-real-time requirements, you might lower this to a few minutes, accepting smaller files as the trade-off.

  • Low traffic reality: During quiet hours, files will be smaller than your size threshold because the time threshold triggers first. This is acceptable—a few small files won’t significantly impact query performance.

The UTC Partition Caveat

One limitation: Firehose partitions by UTC timestamp only. The !{timestamp:yyyy} syntax uses the time when Firehose processes the record, not a timestamp from your data, and it’s always in UTC.

This means if your business operates in UTC+8, a query for “January 15th business day” might need to scan both January 14th and January 15th UTC partitions. In practice, this isn’t a major issue:

  • Athena’s per-query cost is based on data scanned, not partitions opened
  • Scanning an extra day’s partition typically adds minimal cost (a few extra MB to scan)
  • The alternative (custom partitioning logic) adds significant complexity

For our use case—daily batch aggregation—scanning 2-3 days of partitions instead of 1 is perfectly acceptable.

Configuration example (using Pulumi):

const firehose = new aws.kinesis.FirehoseDeliveryStream("events-firehose", {
  destination: "extended_s3",
  extendedS3Configuration: {
    bucketArn: analyticsBucket.arn,
    prefix: "events/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/",
    bufferingSize: 128,       // 128 MB
    bufferingInterval: 900,   // 15 minutes
    dataFormatConversionConfiguration: {
      enabled: true,
      outputFormatConfiguration: {
        serializer: {
          parquetSerDe: { compression: "SNAPPY" }
        }
      },
      schemaConfiguration: {
        databaseName: glueDatabase.name,
        tableName: glueTable.name,
        roleArn: firehoseRole.arn
      }
    }
  }
});

3.4 Glue: More Than an ETL Tool

Many people think of AWS Glue as “an ETL tool,” but its Data Catalog functionality is the real gem.

Glue Data Catalog is essentially a metadata management service. You define table schemas, and services like Firehose, Athena, and Spark can all consume them.

const glueTable = new aws.glue.CatalogTable("events-table", {
  databaseName: glueDatabase.name,
  tableType: "EXTERNAL_TABLE",
  parameters: {
    "classification": "parquet",
    "parquet.compression": "SNAPPY"
  },
  storageDescriptor: {
    location: `s3://${bucket}/events/`,
    columns: [
      { name: "event_type", type: "string" },
      { name: "page", type: "string" },
      { name: "timestamp", type: "bigint" },
      { name: "link", type: "string" },
      // ... other fields
    ]
  },
  partitionKeys: [
    { name: "year", type: "string" },
    { name: "month", type: "string" },
    { name: "day", type: "string" }
  ]
});

A crucial point is partitionKeys. By partitioning by date, queries that specify a date range only scan files in relevant partitions rather than the entire dataset—the impact on cost and performance is orders of magnitude.

3.5 Athena Query Practices

Athena is a serverless Presto engine that charges by data scanned. Combined with Parquet format and proper partitioning, you can achieve very low query costs.

Some practical tips:

1. Handling New Partitions

Firehose automatically creates new directories by date, but Glue tables don’t automatically recognize new partitions. Two approaches:

-- Option 1: Manually add partition
ALTER TABLE events ADD PARTITION (year='2024', month='01', day='15')
LOCATION 's3://bucket/events/year=2024/month=01/day=15/';

-- Option 2: Auto-discover (simple but slow)
MSCK REPAIR TABLE events;

For production, I recommend a scheduled Lambda running option 1, or enabling Glue Crawler for auto-discovery.

2. Always Specify Partitions in Queries

-- Good: Only scans one day of data
SELECT COUNT(*) FROM events
WHERE year = '2024' AND month = '01' AND day = '15';

-- Bad: Full table scan, expensive and slow
SELECT COUNT(*) FROM events;

3. Leverage Parquet’s Columnar Storage

-- Good: Only reads required columns
SELECT event_type, COUNT(*) FROM events GROUP BY event_type;

-- Bad: Reads all columns
SELECT * FROM events LIMIT 100;

3.6 Cost Optimization

As data grows, storage costs become significant. I configured S3 lifecycle rules:

new aws.s3.BucketLifecycleConfiguration("lifecycle", {
  bucket: analyticsBucket.id,
  rules: [{
    status: "Enabled",
    transitions: [
      { days: 90, storageClass: "STANDARD_IA" },     // Move to Infrequent Access after 90 days
      { days: 365, storageClass: "GLACIER" }          // Archive after 1 year
    ]
  }]
});

3.7 Why AWS Ecosystem (and Not ClickHouse)

A natural question is: why not use a dedicated analytics database like ClickHouse?

ClickHouse is an excellent columnar database with blazing-fast query performance. It’s particularly well-suited for real-time analytics dashboards where sub-second query latency matters. If I needed to build an interactive analytics UI with instant filtering and aggregation, ClickHouse would be a strong contender.

However, for this specific use case, the AWS S3 + Athena approach made more sense:

Trade-offs in favor of S3 + Athena:

  1. Batch processing is sufficient: We only need to run aggregation queries once daily to sync summary data back to our main database. We don’t need real-time dashboards or sub-second query latency.

  2. Minimal operational overhead: S3 and Athena are fully managed. No servers to provision, no clusters to scale, no backups to configure. ClickHouse (self-hosted or ClickHouse Cloud) requires more operational attention.

  3. Cost efficiency: For infrequent queries on large datasets, Athena’s pay-per-query model is extremely cost-effective. You only pay when you query ($5 per TB scanned), and with Parquet + partitioning, actual scan costs are minimal.

  4. Data durability: S3 provides 99.999999999% (11 nines) durability. Data is automatically replicated across availability zones.

  5. Ecosystem integration: The data in S3 can be queried by Athena today, but also by Spark, Presto, DuckDB, or even loaded into ClickHouse later. We’re not locked into a specific query engine.

When to consider ClickHouse instead:

  • Real-time analytics dashboards requiring sub-second queries
  • High query frequency (hundreds or thousands of queries per day)
  • Complex analytical queries that benefit from ClickHouse’s optimizations
  • Need for materialized views or real-time aggregations

The beauty of storing raw data in Parquet on S3 is that we can always add ClickHouse later if requirements change—just point it at the same data.

4. Lessons Learned

4.1 Firehose Parquet Conversion Requires Complete Schema

Initially, I tried to cut corners by defining only a few core fields. I discovered that when Firehose converts to Parquet, any JSON fields not in the schema are silently dropped.

Lesson: Either define all possible fields in the Glue table, or strictly filter payloads in Lambda.

4.2 Partition Fields Cannot Appear in Columns

In Glue table definitions, fields in partitionKeys cannot also appear in columns, or you’ll get an error. Partition field values come from the S3 path (e.g., year=2024), not the Parquet file contents.

4.3 IntersectionObserver Safari Compatibility

Safari’s IntersectionObserver support came relatively late, and some iOS versions have bugs. Implement proper feature detection and fall back to scroll listeners when necessary.

5. Conclusion

This system just went live, and I’ll continue to update this post as we gather more production data and insights. Early results are promising—the architecture handles traffic well and costs are minimal.

Key takeaways:

  1. Decouple analytics from your main database: High-volume event ingestion doesn’t belong in your transactional database. A separate pipeline protects your core application.
  2. Columnar storage matters: For analytics workloads, Parquet’s columnar format dramatically reduces I/O and storage costs compared to row-oriented formats.
  3. IntersectionObserver is the right approach for detecting element visibility, far outperforming manual calculations
  4. Kinesis Firehose handles data buffering and format conversion, much simpler than building it yourself
  5. Glue Data Catalog bridges storage and querying—define schemas carefully
  6. Athena charges by scan volume; partitioning and columnar storage are key to cost control
  7. Choose the right tool for your access pattern: Real-time dashboards might need ClickHouse, but batch analytics works great with S3 + Athena at a fraction of the cost

The serverless, pay-per-use model means we only pay for what we use—no idle server costs. And by storing raw data in Parquet on S3, we retain flexibility to adopt different query engines as needs evolve.

I’ll update this post with production metrics and lessons learned as the system matures. If you’re building something similar, I’d love to hear about your experience.