Functional vs Non-Functional Requirements

Why did the Healthcare.gov launch in 2013 become one of the most expensive software failures in history? The website worked — users could fill out forms and submit applications. Every functional requirement was met. But the site crashed under load, response times exceeded 30 seconds, and security vulnerabilities leaked personal data.

[!CAUTION] Elite Insight: The Healthcare.gov Failure Patterns. The 2013 failure wasn’t just “too many users.” It was a failure of Distributed Coordination. The system had massive data silos across different government agencies, no centralized logging (SLIs), and no automated load testing (NFR verification). It was a “Monolithic Frontend” trying to talk to “Legacy Backends” without circuit breakers or timeout strategies.

[!IMPORTANT] In this lesson, you will master:

  1. The FR vs NFR Divide: Why “what the system does” is only half the battle — “how well it does it” determines success or failure.
  2. The SRE Trifecta (SLI → SLO → SLA): The 3-layer reliability contract used by Google, and the mnemonic “I.O.A” (Indicator → Objective → Agreement) to remember the hierarchy.
  3. Error Budgets: How top companies use math to decide when to ship features vs. when to fix reliability.

1. The “Golf Cart” Problem

You walk into a car dealership and ask for a car that “has 4 wheels, an engine, and drives forward.” The dealer hands you a golf cart. Technically, it meets your requirements. But you can’t drive it on the highway, it’s not safe in a crash, and it has zero storage.

In System Design:

  • Functional Requirements (FRs): “It has 4 wheels” (The Verb - What it does).
  • Non-Functional Requirements (NFRs): “It drives at 70mph” (The Adjective - How well it does it).

If you build a system that works (Features) but is slow, insecure, or fragile (NFRs), you have built a golf cart for a highway.


2. The Requirement Hierarchy

To design a world-class system, you must understand how high-level business goals trickle down into technical constraints.

Gold Standard: The Architecture Traceability Map

This diagram shows how a single “Business Need” creates a chain of requirements.

LEVEL 1: BUSINESS GOAL
"We need a Global Photo Sharing App (Instagram Clone)"
FR: CORE FEATURE
"Users can upload photos"
FR: CORE FEATURE
"Users can follow others"
NFR: LATENCY
Upload < 500ms
NFR: RELIABILITY
99.9% Durability
NFR: SCALABILITY
10k QPS (Writes)
NFR: CONSISTENCY
Eventual Con.
💡 Decision Trace: Because we need Global Scale (L1), we choose Eventual Consistency (L3) to favor High Availability.

3. Interactive: The Requirement Sorter Game

Can you distinguish between Functional (FR) and Non-Functional (NFR) requirements? Sort the requirements into the correct bucket!

[!TIP] Try it yourself: Click FR if the card describes a Feature, or NFR if it describes Quality/Performance.

Requirement Sorter

Click FR (Features) or NFR (Quality) for the current item.

Score: 0
Streak: 0 🔥
Press Start to Begin

4. Measuring Reliability: SLO vs SLA vs SLI

Before you can improve reliability, you must be able to measure it. Google’s SRE book defines the “Trifecta” of reliability as follows:

The SRE Trifecta (I.O.A)

Use the “I.O.A” mnemonic to remember the hierarchy of reliability.

1. SLI (Indicator)
The Thermometer

The raw, quantitative measurement of a specific behavior.

Example: "The latency of the /upload API is 150ms."

2. SLO (Objective)
The Thermostat

The target value or range for an SLI over a period of time.

Example: "99.9% of requests must be < 200ms over a 28-day window."

3. SLA (Agreement)
The Insurance Policy

The legal contract with customers detailing consequences if SLOs are missed.

Example: "If availability drops below 99.9%, we refund 10% of your bill."

A. Latency Percentiles: Beyond the Average

Senior engineers never use “Average Latency.” Averages hide the misery of your worst-performing users.

  • p50 (Median): 50% of users experience latency faster than this.
  • p90: 90% of users are faster. Only 1 in 10 is slower.
  • p99 (The Tail): The goal for Elite systems. 99% of users are faster.

[!IMPORTANT] Why p99 matters: If your photo app has a p99 latency of 2 seconds, it means 1 in every 100 users (which is 1 million users if you have 100M total) is having a terrible, sluggish experience. This is where users churn.

B. Error Budgets: The License to Fail

If your SLO is 99.9%, you have an Error Budget of 0.1%.

  • Budget > 0: You can ship new features, take risks, and deploy code.
  • Budget = 0: You STOP all feature development. The entire engineering team shifts to reliability, bug fixes, and infrastructure stabilization until the budget is restored.

Critical Distinction: Availability vs. Reliability

Many candidates confuse these terms.

Metric Definition Example
Availability Is the system reachable? (Uptime) “The site loads.”
Reliability Does the system work correctly? (Success Rate) “The site loads AND the payment processes correctly.”

[!TIP] A system can be Available (returns HTTP 500 Errors instantly) but not Reliable (all requests fail).

Interactive: The Error Budget Calculator

If your SLO is 99.9%, how many minutes of downtime can you afford before you must stop shipping features?

[!TIP] Try it yourself: Select a different Availability Target to see how little downtime you can actually afford per year.

Daily Budget
14m 24s
Weekly Budget
1h 40m
Yearly Budget
8h 45m

5. The Interview Gauntlet

  1. “Is ‘System must be written in Go’ a Functional Requirement?”
    • Ans: Neither. It is a technical constraint. FRs describe user features, NFRs describe system qualities (speed, uptime). The language choice is an implementation detail, not a requirement.
  2. “Why is the SLO always lower than the SLA?”
    • Ans: The SLO is your internal target. You set it higher (stricter) so you can detect issues and fix them before you break the legal contract (SLA) with your customers.
  3. “What happens to feature development when the Error Budget hits 0?”
    • Ans: All feature development MUST halt. The entire engineering team shifts focus to reliability, bug fixes, and infrastructure stabilization until the budget recovers.

6. Summary

  • FRs are Verbs (“User can upload a photo”). NFRs are Adjectives (“Upload completes in under 500ms”). Remember: “Verbs vs Adjectives.”
  • SLI → SLO → SLA — use the mnemonic “I.O.A”: the Indicator is the raw measurement, the Objective is your internal target, and the Agreement is the legal contract with your customers.
  • Availability ≠ Reliability: A system that returns HTTP 500 instantly is “available” but not “reliable.”
  • Every system starts with Business Goals (Level 1), which decompose into Features (Level 2), which are constrained by Quality Attributes (Level 3). This is the Architecture Traceability Map.

Staff Engineer Tip: In system design interviews, always state your NFRs before drawing boxes. Say: “Before I design any components, let me define our target: 99.9% availability, p99 latency under 200ms, and eventual consistency for reads.” This signals senior thinking and gives you a framework to justify every subsequent design choice. If the interviewer challenges a decision, you can trace it back to the NFR: “I chose Cassandra over PostgreSQL because our NFR requires horizontal scalability at the cost of strong consistency.”