Cloud Computing Fundamentals
1. What is cloud computing?
Cloud computing is the delivery of computing resources — servers, storage, databases, networking, software, and analytics — over the internet on a pay-as-you-go basis. Instead of owning and maintaining physical data centres, organisations rent access to these resources from cloud providers like AWS, Microsoft Azure, and Google Cloud. Cloud computing enables faster innovation, flexible resource scaling, economies of scale, and a shift from capital expenditure (CapEx) to operational expenditure (OpEx).
2. What are the three main cloud service models?
IaaS (Infrastructure as a Service) provides virtualised computing resources — virtual machines, storage, networking — giving maximum control. The provider manages physical infrastructure; the user manages OS, middleware, and applications. Examples: AWS EC2, Azure Virtual Machines. PaaS (Platform as a Service) provides a development and deployment environment without managing infrastructure. Examples: AWS Elastic Beanstalk, Google App Engine. SaaS (Software as a Service) delivers complete, ready-to-use applications over the internet. Examples: Gmail, Salesforce, Microsoft 365.
3. What are the cloud deployment models?
Public cloud: resources are owned and operated by a third-party provider (AWS, Azure, GCP) and shared across multiple customers. Lowest cost, most scalable. Private cloud: resources are dedicated to a single organisation, either on-premises or hosted by a provider. Greater control and security, higher cost. Hybrid cloud: combines public and private clouds, allowing data and applications to move between them. Common for keeping sensitive data on-premises while leveraging public cloud scalability. Multi-cloud: using services from multiple providers to avoid vendor lock-in or leverage best-of-breed services.
4. What are the key benefits of cloud computing?
Scalability (scale up or down on demand without hardware procurement), elasticity (automatically adjust resources to match workload), cost efficiency (pay only for what you use), high availability and disaster recovery (built-in redundancy across data centres), global reach (deploy worldwide in minutes), speed and agility (provision resources in seconds vs. weeks for physical hardware), managed services (provider handles patching, maintenance, backups), and security (major providers invest heavily in physical and cyber security, often exceeding what organisations can achieve independently).
5. What is the difference between vertical and horizontal scaling?
Vertical scaling (scaling up) increases the capacity of a single resource — adding more CPU or RAM to an existing server. It is simple but has hardware limits and typically requires downtime. Horizontal scaling (scaling out) adds more instances of a resource — adding more servers to a pool. It is more resilient (no single point of failure) and can scale indefinitely but requires stateless application design and load balancing. Cloud-native applications are designed for horizontal scaling. AWS Auto Scaling groups and Azure scale sets automate horizontal scaling based on metrics like CPU utilisation.
6. What is a region and an availability zone?
A cloud region is a geographic location (e.g., US East, Europe West) containing multiple isolated data centre facilities. Deploying in multiple regions reduces latency for global users and satisfies data residency requirements. An Availability Zone (AZ) is one or more physically separate data centres within a region, each with independent power, cooling, and networking. AZs within a region are connected via low-latency links. Deploying across multiple AZs ensures high availability — if one AZ fails, the application remains available in others. Best practice: deploy critical workloads across at least two AZs.
7. What is the shared responsibility model?
The shared responsibility model divides security responsibilities between the cloud provider and the customer. The provider is responsible for security "of" the cloud: physical infrastructure, hypervisor, networking, and managed services. The customer is responsible for security "in" the cloud: OS patching (for IaaS), application code, data encryption, identity and access management, and network configuration. The division shifts with service model — in SaaS, the provider handles almost everything; in IaaS, the customer manages more. Understanding this model is fundamental to cloud security and compliance.
8. What is a Content Delivery Network (CDN)?
A CDN is a geographically distributed network of edge servers that cache and deliver content (web pages, images, videos) from locations close to end users, reducing latency. When a user requests content, the CDN routes the request to the nearest edge server, serving cached content without hitting the origin server. CDNs improve performance globally, reduce bandwidth costs, and provide DDoS protection. AWS CloudFront, Azure CDN, and Cloudflare are popular CDNs. They are essential for static asset delivery, video streaming, and any application with a global user base.
9. What is a load balancer?
A load balancer distributes incoming traffic across multiple servers or instances so no single instance is overwhelmed, improving availability and fault tolerance. A Layer 4 load balancer (TCP/UDP) makes routing decisions based on network information (IP, port) without inspecting content — it is fast. A Layer 7 load balancer (HTTP/HTTPS) inspects application content (URL path, headers) to route requests intelligently, enabling path-based routing, SSL termination, and sticky sessions. AWS ALB (Application Load Balancer) and NLB (Network Load Balancer) and Azure Load Balancer serve these roles.
10. What is serverless computing?
Serverless computing allows developers to run code without provisioning or managing servers. The cloud provider automatically allocates resources, scales to zero when not in use (no idle costs), and scales instantly to handle demand. AWS Lambda, Azure Functions, and Google Cloud Functions are the major serverless offerings. Functions are triggered by events (HTTP requests, file uploads, queue messages). Serverless is ideal for event-driven, intermittent workloads. Limitations include cold starts (latency when a function hasn't run recently), execution time limits, and debugging complexity.
AWS Core Services
11. What is Amazon EC2?
Amazon EC2 (Elastic Compute Cloud) provides resizable virtual machines in the cloud. Instance types are categorised by use case: general purpose (t3, m5), compute optimised (c5), memory optimised (r5), and GPU (p3, g4). Instances run within a VPC, use security groups as firewalls, and attach EBS volumes for persistent storage. EC2 pricing models include On-Demand (pay by the hour), Reserved Instances (1-3 year commitment with significant discount), Spot Instances (bid on spare capacity, up to 90% discount but can be interrupted), and Savings Plans (flexible, commitment-based discounts).
12. What is Amazon S3?
Amazon S3 (Simple Storage Service) is an object storage service providing virtually unlimited storage with 99.999999999% durability. Objects are stored in buckets and accessed via URL. S3 storage classes optimise cost vs. access frequency: S3 Standard (frequent access), Intelligent-Tiering (automatic cost optimisation), Standard-IA (infrequent access), and Glacier (archival). S3 is used for data lakes, static website hosting, backup and archival, ML training data storage, and application asset storage. Bucket policies, ACLs, and Block Public Access settings control who can access data.
13. What is a VPC?
Amazon VPC (Virtual Private Cloud) is a logically isolated section of the AWS cloud where you launch AWS resources in a virtual network you define. You control IP address ranges, subnets, route tables, internet gateways, and NAT gateways. Public subnets have internet access via an internet gateway; private subnets do not. Security groups act as stateful virtual firewalls at the instance level. Network ACLs (NACLs) are stateless firewalls at the subnet level. VPCs are the foundational networking layer for AWS security architecture and must be designed carefully before deploying any production workloads.
14. What is IAM in AWS?
IAM (Identity and Access Management) controls who can access AWS resources and what actions they can perform. Core components: Users (individual identities), Groups (collections of users sharing policies), Roles (identities assumed by AWS services or applications — no long-term credentials), and Policies (JSON documents defining allowed or denied actions on specific resources). Best practices: grant only the permissions needed (least privilege), enable MFA for all human users, use roles instead of access keys on EC2 instances, and regularly review permissions using IAM Access Analyzer.
15. What is Amazon RDS?
Amazon RDS (Relational Database Service) is a managed relational database service supporting MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, and Amazon Aurora. RDS handles provisioning, patching, backups, monitoring, and replication. Multi-AZ deployments provide high availability through synchronous replication to a standby instance that automatically fails over during an outage. Read replicas offload read traffic. Amazon Aurora is AWS's MySQL/PostgreSQL-compatible database built for the cloud, offering 5x MySQL performance with automatic storage scaling up to 128TB.
16. What is Amazon DynamoDB?
DynamoDB is a fully managed, serverless key-value and document NoSQL database designed for single-digit millisecond performance at any scale. It is schemaless (except for the primary key), automatically replicates across multiple AZs, and scales throughput and storage seamlessly. Global tables provide multi-region, multi-master replication. DynamoDB Streams enable event-driven processing of data changes. It is ideal for gaming leaderboards, shopping carts, IoT event storage, and any application requiring consistent, high-performance reads and writes at massive scale.
17. What is Amazon Redshift?
Amazon Redshift is a fully managed, petabyte-scale data warehouse service optimised for complex SQL queries on large datasets. It uses columnar storage, data compression, and massively parallel processing (MPP) across a cluster of nodes to deliver fast query performance. Redshift Spectrum allows querying data directly in S3 without loading it. Redshift Serverless eliminates cluster management. It integrates with BI tools (Tableau, Power BI, Looker) and is the analytics layer in many cloud data warehouse architectures alongside S3 data lakes.
18. What is Amazon SQS and SNS?
SQS (Simple Queue Service) is a fully managed message queue that decouples components of distributed applications. Producers send messages to a queue; consumers poll and process them independently. Standard queues offer best-effort ordering; FIFO queues guarantee exactly-once processing in order. SNS (Simple Notification Service) is a fully managed pub/sub messaging service where publishers send messages to topics and multiple subscribers (SQS, Lambda, HTTP, email) receive them simultaneously. Together, SQS and SNS enable event-driven, loosely coupled architectures where components communicate asynchronously.
19. What is AWS Lambda?
AWS Lambda is a serverless compute service that runs code in response to events without provisioning servers. You upload code (Python, Node.js, Java, Go) and Lambda handles scaling, availability, and patching. Pricing is per invocation and duration — no charge when idle. Lambda functions can run up to 15 minutes and use up to 10GB memory. Triggers include HTTP requests via API Gateway, S3 uploads, DynamoDB stream changes, SQS messages, and EventBridge schedules. Cold start latency (100ms-1s) is a limitation for latency-sensitive workloads.
20. What is Amazon CloudWatch?
Amazon CloudWatch is the monitoring and observability service for AWS. It collects metrics from AWS resources (EC2 CPU, RDS connections, Lambda errors), custom application metrics, and logs. CloudWatch Alarms trigger notifications (SNS) or automated actions (Auto Scaling) when metrics breach thresholds. CloudWatch Logs Insights enables querying log data with SQL-like syntax. CloudWatch Dashboards provide visual monitoring. It is the first tool to check when debugging AWS resource issues and is essential for production operations.
Cloud Architecture
21. What is a microservices architecture?
Microservices architecture structures an application as a collection of small, independently deployable services, each responsible for a specific business capability and communicating via APIs (REST, gRPC) or message queues. It contrasts with monolithic architecture where all functionality is deployed as a single unit. Microservices enable independent deployment, scaling, and fault isolation — a failure in one service does not crash the entire application. Challenges include distributed system complexity, network latency, data consistency across services, and operational overhead for monitoring and service discovery.
22. What is Infrastructure as Code (IaC)?
IaC is the practice of provisioning and managing cloud infrastructure through machine-readable configuration files rather than manual processes. Benefits include version control for infrastructure, reproducibility across environments, and automated provisioning. AWS CloudFormation uses JSON/YAML templates to provision AWS resources. Terraform (HashiCorp) uses HCL and is cloud-agnostic — it supports AWS, Azure, GCP, and others. IaC is a DevOps best practice that treats infrastructure the same as application code, enabling CI/CD pipelines for infrastructure changes.
23. What is Docker and how does it differ from a virtual machine?
Docker packages applications and their dependencies into portable containers. A container includes the application code, runtime, system libraries, and configuration in an isolated environment. Unlike virtual machines, containers share the host OS kernel — making them lightweight (megabytes vs. gigabytes), fast to start (seconds vs. minutes), and more resource-efficient. VMs provide stronger isolation (separate kernel). Docker uses a Dockerfile to define the image. Container images are stored in registries (Docker Hub, Amazon ECR, Azure ACR).
24. What is Kubernetes?
Kubernetes (K8s) is the industry-standard open-source container orchestration platform that automates deployment, scaling, networking, and management of containerised applications. Core objects include Pods (the smallest deployable unit), Deployments (manage replica pods with rolling updates), Services (stable network endpoint for a set of pods), ConfigMaps and Secrets (externalised configuration), and HorizontalPodAutoscaler (auto-scaling). AWS EKS, Azure AKS, and GCP GKE are managed Kubernetes services. kubectl is the CLI for interacting with clusters.
25. What is CI/CD?
CI/CD (Continuous Integration / Continuous Delivery or Deployment) automates the software delivery pipeline. Continuous Integration automatically builds and tests code whenever developers commit, catching issues early. Continuous Delivery automates deployment to staging with manual approval for production. Continuous Deployment automatically deploys every passing build to production. AWS CodePipeline, GitHub Actions, GitLab CI, Jenkins, and Azure DevOps implement CI/CD pipelines. CI/CD reduces manual effort, accelerates release cycles, and improves code quality by making the delivery process repeatable.
26. What is a data lake vs. a data warehouse?
A data lake stores raw, unprocessed data in its native format (structured, semi-structured, unstructured) at low cost — typically on object storage like S3 or Azure Data Lake Storage Gen2. Structure is applied when data is queried (schema-on-read). A data warehouse stores structured, processed, and modelled data optimised for analytical queries — schema is defined upfront (schema-on-write). Data lakes are flexible and cheap but require more effort to query. Data warehouses are fast and governed but less flexible. The Lakehouse architecture (Delta Lake, Apache Iceberg) combines both paradigms.
27. What is auto-scaling?
Auto-scaling automatically adjusts the number of compute resources in response to demand, ensuring performance during peaks while minimising cost during quiet periods. AWS Auto Scaling Groups define minimum, maximum, and desired instance counts with scaling policies based on CloudWatch metrics. Target tracking policies maintain a target metric value (e.g., 70% CPU utilisation). Scheduled scaling handles predictable traffic patterns. Predictive scaling uses ML to pre-scale before anticipated demand. Container auto-scaling in Kubernetes uses the Horizontal Pod Autoscaler (HPA).
28. What is a CDN and how does it improve performance?
A CDN (Content Delivery Network) caches content at edge locations geographically close to users so requests are served from the nearest edge rather than the origin server, dramatically reducing latency. For dynamic content, CDNs optimise routing through their own network backbone. AWS CloudFront has over 400 edge locations worldwide. CDN cache behaviour is controlled via TTL (time to live) settings and cache invalidation. CDNs also absorb DDoS traffic at the edge before it reaches the origin. For any application with global users or large static assets, a CDN is essential for performance and cost.
29. What is event-driven architecture?
Event-driven architecture structures applications around the production, detection, and reaction to events — state changes that trigger actions. Components communicate asynchronously via an event bus or message broker. In AWS: producers emit events to EventBridge or SNS, and consumers (Lambda, SQS, Step Functions) process them independently. Event-driven architecture enables loose coupling, high scalability, and resilience. Patterns include event sourcing (storing all state changes as events) and CQRS (Command Query Responsibility Segregation). It is the dominant pattern for serverless and microservices architectures.
30. What is a multi-cloud strategy?
A multi-cloud strategy uses services from two or more cloud providers (e.g., AWS for primary workloads, Azure for Microsoft 365 integration, GCP for BigQuery analytics) to avoid vendor lock-in, optimise costs, improve resilience, and leverage each provider's strengths. Challenges include managing multiple consoles and billing, inconsistent security policies, data egress costs between clouds, and the overhead of maintaining cloud-agnostic infrastructure code. Kubernetes, Terraform, and open standards (OpenTelemetry, CloudEvents) reduce multi-cloud friction.
Cloud Security
31. What is encryption at rest and in transit?
Encryption at rest protects data stored on disk using AES-256 encryption with keys managed by AWS KMS or customer-managed keys. Encryption in transit protects data moving between services using TLS/HTTPS. AWS Certificate Manager provisions and manages SSL/TLS certificates. Best practice: enable encryption for all data stores (S3 default encryption, RDS encrypted instances) and enforce HTTPS via bucket policies and security groups. Many compliance frameworks (PCI-DSS, HIPAA, GDPR) mandate both forms of encryption. Never store unencrypted sensitive data in cloud storage.
32. What is AWS KMS?
AWS KMS (Key Management Service) is a managed service for creating and controlling cryptographic keys. KMS integrates natively with most AWS services (S3, EBS, RDS, Lambda) for envelope encryption: data is encrypted with a data encryption key (DEK), and the DEK is encrypted by a KMS Customer Master Key (CMK). KMS provides key rotation, audit logging via CloudTrail, and access control via IAM policies. Customer-managed CMKs give more control than AWS-managed CMKs. Hardware security module (HSM)-backed keys via CloudHSM provide FIPS 140-2 Level 3 validated key storage.
33. What is a security group vs. a NACL?
A Security Group is a stateful virtual firewall attached to an EC2 instance or resource. Stateful means return traffic is automatically allowed for established connections. Rules only allow traffic — there are no explicit deny rules. A Network ACL (NACL) is a stateless, subnet-level firewall. Stateless means return traffic must be explicitly allowed. NACLs support both allow and deny rules, evaluated in order by rule number. Security groups are the primary network security control; NACLs add a subnet-level defence-in-depth layer. Default security groups allow all outbound and block all inbound by default.
34. What is the principle of least privilege in cloud IAM?
Least privilege means granting identities only the exact permissions required for their specific function. Implemented by: using specific resource ARNs instead of wildcards in policies, using conditions (IP range, MFA required) to restrict access, avoiding the AdministratorAccess managed policy, using IAM Access Analyzer to identify unused permissions, regularly reviewing and revoking stale permissions, preferring roles (temporary credentials) over users (long-term credentials), and using Service Control Policies (SCPs) in AWS Organizations to enforce guardrails across all accounts.
35. What is AWS GuardDuty?
Amazon GuardDuty is a threat detection service that continuously monitors AWS accounts for malicious activity using machine learning, anomaly detection, and threat intelligence. It analyses CloudTrail events, VPC Flow Logs, and DNS logs without requiring agents. GuardDuty detects threats such as EC2 instances communicating with known command-and-control servers, unusual API calls from foreign IPs, credential compromise, and cryptocurrency mining. Findings are categorised by severity and can trigger automated responses via Lambda. It requires no configuration beyond enabling it in the AWS console.
DevOps & Advanced
36. What is Terraform?
Terraform is an open-source Infrastructure as Code tool by HashiCorp for defining cloud infrastructure in HCL (HashiCorp Configuration Language) configuration files. It supports over 3,000 providers (AWS, Azure, GCP, Kubernetes). Workflow: terraform init (initialise, download providers), terraform plan (preview changes), terraform apply (create infrastructure), terraform destroy (tear down). Terraform state tracks current infrastructure. Remote state in S3 with DynamoDB locking enables team collaboration. Modules enable reusable, composable infrastructure components shared across teams.
37. What is the difference between blue-green and canary deployments?
Blue-green deployment maintains two identical production environments (blue = current, green = new version). Traffic switches entirely from blue to green at cutover — rollback is instant by switching back. Zero downtime but doubles infrastructure cost during deployment. Canary deployment gradually routes a small percentage of traffic (e.g., 5%) to the new version, monitors metrics, and incrementally increases traffic if stable. Canary is more sophisticated and limits blast radius of bad deployments. Feature flags provide a software-level canary approach without infrastructure changes.
38. What is observability in cloud systems?
Observability is the ability to understand the internal state of a system from its external outputs. It comprises three pillars: Metrics (numerical measurements over time — CPU, request count, error rate), Logs (discrete events with timestamps for debugging), and Traces (end-to-end tracking of a request across multiple services in a distributed system). Tools include CloudWatch and Datadog (metrics), ELK Stack/Splunk (logs), and AWS X-Ray/OpenTelemetry (tracing). High observability reduces MTTR (Mean Time to Recovery) when incidents occur.
39. What is FinOps and cloud cost optimisation?
FinOps is the practice of managing cloud costs through cross-functional collaboration between engineering, finance, and business teams. Key optimisation strategies include rightsizing (matching instance types to actual utilisation), using Reserved Instances or Savings Plans for predictable workloads (30-70% savings), Spot Instances for fault-tolerant workloads (up to 90% savings), scheduling non-production resources off during nights and weekends, implementing auto-scaling, eliminating idle resources, and using Graviton (ARM-based) instances which are 20% cheaper than x86. AWS Cost Explorer, Cost and Usage Reports, and Trusted Advisor support cost analysis.
40. What is a disaster recovery strategy?
Disaster recovery (DR) restores IT operations after a catastrophic failure. Key metrics: RTO (Recovery Time Objective) is the maximum acceptable downtime; RPO (Recovery Point Objective) is the maximum acceptable data loss. Common DR strategies in order of cost and complexity: Backup and Restore (hours to days RTO), Pilot Light (minimal infrastructure in DR region, scales up during disaster), Warm Standby (scaled-down production replica), and Active-Active (full production in multiple regions, near-zero RTO/RPO). AWS CloudEndure and Azure Site Recovery automate DR replication.
41. What is Amazon Redshift Spectrum?
Redshift Spectrum allows Redshift to query data stored directly in S3 without first loading it into Redshift tables. Queries are distributed across thousands of Redshift Spectrum nodes that read data in parallel from S3. This enables querying petabytes of data in open formats (Parquet, ORC, JSON, CSV) without the cost of loading and storing all data in Redshift. Spectrum is ideal for infrequently accessed historical data, data lake queries, and hybrid architectures where recent data is in Redshift and historical data stays in S3. Performance improves significantly when data is in columnar format (Parquet/ORC).
42. What is Amazon Kinesis?
Amazon Kinesis is a family of services for real-time data streaming. Kinesis Data Streams is a durable, scalable data stream that ingests and stores records for up to 365 days, allowing multiple consumers to process the same stream. Kinesis Data Firehose automatically loads streaming data into S3, Redshift, or Elasticsearch — the simplest option for data lake ingestion. Kinesis Data Analytics runs SQL or Apache Flink queries on live streams for real-time analytics. Kinesis Video Streams ingests, stores, and processes video streams. Together they form a complete real-time data pipeline platform on AWS.
43. What is AWS Glue?
AWS Glue is a fully managed ETL (Extract, Transform, Load) service that makes it easy to move and transform data between data stores. Glue Data Catalog is a central metadata repository (database and table definitions) that integrates with Athena, Redshift Spectrum, and EMR. Glue Crawlers automatically discover the schema of data in S3 and update the catalog. Glue ETL Jobs run PySpark or Python scripts to transform data at scale. Glue Studio provides a visual drag-and-drop interface for building ETL pipelines. Glue is the primary managed ETL tool in AWS data lake architectures.
44. What is Amazon Athena?
Amazon Athena is an interactive, serverless query service that allows you to analyse data directly in S3 using standard SQL without loading it into a database. You pay only for the queries you run (per TB scanned). Performance and cost are optimised by partitioning data (filtering to specific partitions avoids scanning irrelevant data), using columnar formats (Parquet/ORC scan less data than CSV), and using compression. Athena integrates with AWS Glue Data Catalog for schema management and with QuickSight for visualisation. It is the standard tool for ad-hoc data lake queries on AWS.
45. What is the CAP theorem?
The CAP theorem states that a distributed system can only guarantee two of three properties simultaneously: Consistency (all nodes see the same data at the same time), Availability (every request receives a response), and Partition Tolerance (the system continues operating despite network partitions). Since network partitions are inevitable in distributed systems, real-world systems must choose between Consistency + Partition Tolerance (CP — HBase, ZooKeeper) or Availability + Partition Tolerance (AP — Cassandra, DynamoDB). Understanding CAP guides the choice of database and architecture for distributed cloud applications.
46. What is a cloud audit trail?
A cloud audit trail records every action taken by users and services — API calls, console logins, resource changes — providing a chronological log for security forensics, compliance, and operational troubleshooting. AWS CloudTrail logs all API calls (who, what, when, from where) and stores them in S3. CloudTrail Insights detects unusual API activity patterns. Azure Monitor Activity Log serves the same purpose. Audit trails are mandatory for compliance frameworks like PCI-DSS, HIPAA, SOC 2, and ISO 27001. Immutable logs (S3 with Object Lock) prevent tampering with evidence.
47. What is Amazon ECS vs EKS?
Both run containers on AWS but differ in orchestration. ECS (Elastic Container Service) is AWS's proprietary container orchestration service, simpler to operate and deeply integrated with AWS services (IAM, CloudWatch, ALB). Fargate launch type removes the need to manage EC2 instances. EKS (Elastic Kubernetes Service) is a managed Kubernetes service, using the industry-standard Kubernetes API. EKS offers full Kubernetes compatibility and is preferred when portability across clouds matters or when your team already knows Kubernetes. ECS is simpler for AWS-only workloads; EKS is preferred for Kubernetes-native teams.
48. What is object storage and when is it preferred over block storage?
Object storage stores data as objects (unique ID, data, metadata) in a flat namespace, accessed via HTTP APIs. It is infinitely scalable at low cost. Examples: AWS S3, Azure Blob Storage. Block storage stores data in fixed-size blocks like a traditional hard drive, accessed at low latency via file system protocols. Examples: AWS EBS, Azure Managed Disks — used as persistent volumes for EC2 VMs. Object storage is preferred for large-scale unstructured data (files, images, backups, data lakes) and global access. Block storage is required for databases, OS volumes, and workloads needing low-latency random I/O.
49. What is a managed service vs. a self-managed service?
A managed service is one where the provider handles operational tasks — provisioning, patching, backups, scaling, high availability, and monitoring. Examples: RDS (vs. running MySQL on EC2), AWS MSK (Kafka), Amazon OpenSearch. Self-managed means deploying and operating the software on VMs yourself, with full control but full operational responsibility. Managed services trade customisation for operational simplicity. They are preferred for most teams as they let engineers focus on application logic rather than infrastructure operations. The cost premium is typically justified by reduced operational overhead and improved reliability.
50. How do you design a highly available and fault-tolerant cloud architecture?
Key principles: deploy across multiple Availability Zones to survive single AZ failures; use load balancers to distribute traffic and route around failed instances; implement auto-scaling to replace failed instances automatically; use managed services (RDS Multi-AZ, DynamoDB Global Tables) with built-in HA; decouple components with message queues (SQS) so failures do not cascade; design stateless application tiers so any instance can serve any request; implement health checks, circuit breakers, and retry logic with exponential backoff; back up data across regions for disaster recovery; and test failure scenarios regularly with chaos engineering (AWS Fault Injection Simulator).