Building an Impression Tracking System with AWS
TL;DR
- Main database can’t handle high-frequency writes, need a separate pipeline
- Parquet columnar storage + S3 = cheap and query-friendly
- Firehose auto-buffers, no need to batch files yourself
- Remember to use 4-point scale… wait, wrong post, this one’s about tracking
Why Not Just Write to the Main DB?
Recently needed to track link impressions across partner websites. First instinct: the usual route—Client → API → Sidekiq → PostgreSQL.
But this approach has problems:
Database pressure: Impression events can be massive—millions of writes per day. Even async processing via Sidekiq would compete with business transactions.
Queue saturation: Traffic spike in impressions floods the queue, and suddenly email delivery and payment processing are stuck waiting.
Slow analytics: PostgreSQL is row-oriented. Running GROUP BY on millions of records? Painfully slow.
So I built a dedicated analytics pipeline instead.
Business Context
Publishers use our content in two ways:
- Inline links: Embed our links directly in their articles
- Widgets: Embed our component via iframe, containing multiple links
To track both scenarios, the JavaScript SDK deploys in two places:
- Publisher’s page: track inline links + monitor iframe visibility
- Inside the iframe: track links within the widget
The two SDK instances coordinate—parent page tells the iframe when users can actually see it.
SDK Design
What Counts as “Actually Seen”
Does a link appearing on page count as an impression? What if users scroll past in a flash?
Two key parameters:
- Visibility threshold: What percentage of the element needs to be visible?
- Duration threshold: How long does it need to stay visible?
Threshold too high misses valid impressions. Duration too short catches phantom impressions during fast scrolling. Tune based on your business needs.
IntersectionObserver
For visibility detection, don’t use scroll events and manual position calculation—bad performance, complex logic.
Modern browsers have IntersectionObserver, efficient and hassle-free:
const observer = new IntersectionObserver((entries) => {
entries.forEach(entry => {
if (entry.isIntersecting && entry.intersectionRatio >= VISIBILITY_THRESHOLD) {
startTimer(entry.target);
} else {
clearTimer(entry.target);
}
});
}, {
threshold: [0, VISIBILITY_THRESHOLD]
});
observer.observe(linkElement);
Cross-iframe Communication
Links inside iframes have a problem: visible in the iframe viewport doesn’t mean users can see them—the iframe itself might be scrolled out of the parent page.
Solution: parent and child SDKs talk to each other:
// Parent page SDK
const iframeObserver = new IntersectionObserver((entries) => {
entries.forEach(entry => {
iframe.contentWindow.postMessage({
type: 'visibility-update',
isVisible: entry.isIntersecting,
visibleRegion: entry.intersectionRect
}, '*');
});
}, { threshold: [0, 0.1, 0.5, 1.0] });
iframeObserver.observe(iframe);
// iframe SDK
window.addEventListener('message', (e) => {
if (e.data.type === 'visibility-update') {
parentVisibility = e.data;
// Only truly visible if: iframe visible AND link visible within iframe
}
});
Transmission Strategy
Frequent requests waste bandwidth, but batch too long and you lose data when users leave. My approach:
- Batching: Send when 20 events accumulated or 2 seconds passed
- Max wait: First event in queue must be sent within 5 seconds
- sendBeacon on page close: Guarantees data delivery
window.addEventListener('visibilitychange', () => {
if (document.visibilityState === 'hidden') {
const blob = new Blob([JSON.stringify(events)], { type: 'text/plain' });
navigator.sendBeacon(apiUrl, blob);
}
});
Payload Compression
Abbreviate field names to save bandwidth:
// Full format
{ event_type: 'link_impression', timestamp: 1701234567890, link: '...', label: '...' }
// Compressed
{ et: 'link_impression', ts: 1701234567890, l: '...', lb: '...' }
Shared fields like page URL and referrer get hoisted to batch level.
Backend Architecture
Why Columnar Storage
PostgreSQL is row-oriented—querying SELECT page, COUNT(*) FROM events GROUP BY page reads entire rows even when you only need one column. For a 20-column, 10-million-row table, that’s hugely wasteful.
Columnar storage stores by column. Same query only reads the page column—potentially 20x less I/O. Compression is better too: similar values within a column compress more efficiently.
Parquet is an open-source columnar file format:
- Only reads needed columns
- SNAPPY compression typically 3-5x
- Predicate pushdown skips non-matching data blocks
- Self-describing with embedded schema
Parquet on S3 costs roughly $0.023/GB/month for Standard—way cheaper than database storage.
Overall Architecture
Client SDK → CloudFront → API Gateway → Lambda → Kinesis Firehose → S3 (Parquet)
↓
Athena
Serverless, pay-per-use, auto-scales during traffic spikes.
Why Firehose
Initially considered having Lambda write directly to S3. Problems became obvious fast:
- One file per request = massive small files = query performance tanks
- Implementing buffering yourself is complex and error-prone
Firehose solves this:
- Auto-buffers, flushes by size (e.g., 128MB) or time (e.g., 15 minutes)
- Built-in Parquet conversion
- Auto-partitions by time when writing to S3
How Athena queries work: List all files in partition → read Parquet metadata → predicate pushdown to skip irrelevant row groups → read only needed columns.
Each file has overhead. Thousands of tiny files (100KB range), overhead dominates query time. Fewer large files (128MB range), overhead becomes negligible.
Buffer configuration:
Firehose flushes when hitting size threshold or time threshold—whichever comes first.
- Size threshold: Larger files query faster, but over 256MB might slow down (Athena can’t parallelize within single file effectively)
- Time threshold: Ensures data doesn’t sit in buffer too long during low traffic
Low traffic periods produce files smaller than size threshold—time threshold triggers first. Few small files don’t significantly impact query performance.
UTC partition gotcha:
Firehose only partitions by UTC. !{timestamp:yyyy} uses processing time, not timestamps in your data.
If your business runs in UTC+8, querying “January 15th business day” might need to scan both January 14th and 15th UTC partitions. Not a big deal in practice—Athena charges by data scanned, extra partition adds minimal cost.
Configuration example (Pulumi):
const firehose = new aws.kinesis.FirehoseDeliveryStream("events-firehose", {
destination: "extended_s3",
extendedS3Configuration: {
bucketArn: analyticsBucket.arn,
prefix: "events/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/",
bufferingSize: 128, // 128 MB
bufferingInterval: 900, // 15 minutes
dataFormatConversionConfiguration: {
enabled: true,
outputFormatConfiguration: {
serializer: {
parquetSerDe: { compression: "SNAPPY" }
}
},
schemaConfiguration: {
databaseName: glueDatabase.name,
tableName: glueTable.name,
roleArn: firehoseRole.arn
}
}
}
});
Glue Data Catalog
Many think Glue is just an ETL tool. The Data Catalog is actually the highlight—a metadata management service. Define schema once, Firehose, Athena, Spark all use it.
const glueTable = new aws.glue.CatalogTable("events-table", {
databaseName: glueDatabase.name,
tableType: "EXTERNAL_TABLE",
parameters: {
"classification": "parquet",
"parquet.compression": "SNAPPY"
},
storageDescriptor: {
location: `s3://${bucket}/events/`,
columns: [
{ name: "event_type", type: "string" },
{ name: "page", type: "string" },
{ name: "timestamp", type: "bigint" },
{ name: "link", type: "string" },
// ...
]
},
partitionKeys: [
{ name: "year", type: "string" },
{ name: "month", type: "string" },
{ name: "day", type: "string" }
]
});
partitionKeys is crucial. Partition by date, and queries with date ranges only scan relevant partitions—cost and performance difference is orders of magnitude.
Athena Query Tips
Athena is serverless Presto, charges by data scanned.
Handling new partitions:
Firehose auto-creates new directories, but Glue table doesn’t auto-detect them. Two approaches:
-- Manual add
ALTER TABLE events ADD PARTITION (year='2024', month='01', day='15')
LOCATION 's3://bucket/events/year=2024/month=01/day=15/';
-- Auto-discover (simple but slow)
MSCK REPAIR TABLE events;
Production: use scheduled Lambda for manual add, or enable Glue Crawler.
Always specify partitions in queries:
-- Good: scans one day
SELECT COUNT(*) FROM events
WHERE year = '2024' AND month = '01' AND day = '15';
-- Bad: full table scan, expensive and slow
SELECT COUNT(*) FROM events;
Only query needed columns:
-- Good
SELECT event_type, COUNT(*) FROM events GROUP BY event_type;
-- Bad
SELECT * FROM events LIMIT 100;
Cost Optimization
Configured S3 lifecycle to auto-transition old data to cold storage:
new aws.s3.BucketLifecycleConfiguration("lifecycle", {
bucket: analyticsBucket.id,
rules: [{
status: "Enabled",
transitions: [
{ days: 90, storageClass: "STANDARD_IA" }, // Move to IA after 90 days
{ days: 365, storageClass: "GLACIER" } // Archive after 1 year
]
}]
});
Why Not ClickHouse
ClickHouse has fast queries, great for real-time dashboards. But for our use case:
- Batch processing is enough: Run aggregation once daily, sync back to main DB. Don’t need sub-second latency
- Don’t want to operate it: S3 + Athena is fully managed, no servers, scaling, or backups to worry about
- Cost effective: Infrequent queries on large datasets, pay-per-query makes sense ($5 per TB scanned)
- No lock-in: Data sits on S3, use Athena today, switch to Spark, DuckDB, or even ClickHouse tomorrow
When to use ClickHouse:
- Real-time dashboards needing sub-second queries
- High query frequency (hundreds or thousands per day)
- Need materialized views or real-time aggregations
Raw data stored as Parquet on S3—can always add ClickHouse later.
Lessons Learned
Parquet Conversion Needs Complete Schema
Got lazy initially, only defined a few core fields. Firehose converting to Parquet silently dropped fields not in schema.
Lesson: Either define all possible fields in Glue table, or strictly filter payloads in Lambda.
Partition Fields Can’t Appear in Columns
In Glue table definitions, fields in partitionKeys can’t also appear in columns, or you get errors. Partition field values come from S3 path (e.g., year=2024), not Parquet file contents.
Safari’s IntersectionObserver
Safari’s IntersectionObserver support came late, some iOS versions have bugs. Implement feature detection, fall back to scroll listeners when needed.
Wrap Up
System just went live. Early results look good—architecture handles traffic, costs stay low.
Key takeaways:
- High-volume event ingestion doesn’t belong in main database—separate pipeline protects core app
- Analytics workloads benefit massively from columnar storage
- IntersectionObserver is the right way to detect visibility
- Firehose handles buffering and format conversion, way easier than DIY
- Glue Data Catalog bridges storage and query engines—define schemas carefully
- Athena charges by scan volume, partitioning and columnar storage are key to costs
- Real-time dashboards might need ClickHouse, batch analytics works fine with S3 + Athena
Serverless pay-per-use means no idle costs. Data in Parquet on S3 means flexibility to switch query engines later.
Will update with production metrics once it runs for a while.