If you are facing high response times and application crashes while dealing with large datasets, then reading this blog is going to be helpful for you. Here in this blog, we are going to explore different strategies to optimize a query and how to scale the database which contains large datasets.
1. Analyze Before You Optimize: Understanding the Query Execution Plan
Before optimizing the query, we should analyze the plan of the query by using query plan analysis tools such as:
- Explain: available in MySQL, PostgreSQL, or any other modern databases, which will give the plan of the queries that the database engine is executing.
- Explain analyze: it will return the execution plan along with execution time.
SELECT e.name FROM employees e WHERE e.salary > 50000 AND e.name LIKE '%son%';
Output:
Seq Scan on employees e (cost=0.00..14500.00 rows=12000 width=32) (actual time=0.012..98.456 rows=11500 loops=1) Filter: ((salary > 50000) AND (name ~~ '%son%'::text)) Rows Removed by Filter: 988500 Planning Time: 0.189 ms Execution Time: 98.776 ms
Bottlenecks Identified:
- Seq Scan on a large table
- Rows Removed by Filter: nearly 1 million — means the scan is expensive
- Execution Time is high (98ms) due to full table scan and string pattern matching
It is better to use query analysis tools even after the query optimization to understand where we gained the performance and where we can improve so that we can write optimized queries in future.
2. Indexing: For Faster Lookups
If you observe your query is doing a sequential scan by looking at the query execution plan, adding an index is useful. Most used indexing strategies are:
B+ Tree Indexes:
B+ tree indexes are helpful in searching, sorting, and while using range queries and are highly useful in efficient storing of data.
GIN (Generalized Inverted Index):
GIN indexes excel when dealing with columns containing multiple values, such as:
- JSONB columns
- Array columns
- Full-text search fields
3. Query Rewriting:
Sometimes writing a query in a different way can give the results with less response time. These are some such instances:
Use Joins Instead of Subqueries:
Instead of this:
SELECT name FROM employees WHERE department_id IN (SELECT id FROM departments WHERE location = 'New York');
Try this:
SELECT e.name FROM employees e JOIN departments d ON e.department_id = d.id WHERE d.location = 'New York';
Reasons to Prefer JOIN:
1. Better Performance on Large Datasets:
- Most modern databases optimize JOINs more efficiently than subqueries.
- IN (SELECT ...) can create a temporary result set in memory and compare every value from the outer query to this set, which can be expensive for large datasets.
2. Index Utilization:
- With a JOIN, indexes on the JOIN key (department_id, id) are often better used by the query planner than with IN.
3. Readability:
- JOINs provide more transparent relationships between tables, which is helpful for both debugging and understanding the logic.
Use EXISTS Instead of IN for Related Data
Instead of this:
SELECT * FROM orders WHERE customer_id IN (SELECT id FROM customers WHERE region = 'West');
Try this:
SELECT o.* FROM orders o WHERE EXISTS ( SELECT 1 FROM customers c WHERE c.id = o.customer_id AND c.region = 'West' );
Reasons to Prefer EXISTS:
1. Efficiency with Large Inner Queries:
- EXISTS stops as soon as it finds one matching row, which makes it more efficient when the subquery returns many rows.
- IN must evaluate all possible values, which can be slow if the inner query returns a large result set.
2. NULL Handling:
- IN behaves unpredictably if the subquery returns NULL values.
- EXISTS is immune to NULL issues since it checks for the existence of rows, not specific values.
3. Optimized Short-Circuiting:
- EXISTS often leverages short-circuiting, meaning the DB engine will exit as soon as it finds a match, reducing workload
4. Use Temporary Tables or Common Table Expressions (CTEs):
When dealing with large datasets or complex queries, breaking down the logic into manageable parts can significantly improve performance and readability. That's where Temporary Tables and Common Table Expressions (CTEs) come in.
WITH high_salary_employees AS ( SELECT * FROM employees WHERE salary > 50000 ) SELECT name FROM high_salary_employees WHERE name LIKE '%son%';
Temporary tables allow you to store intermediate results and reuse them in subsequent steps of your query or process. They're especially helpful when:
- You're performing expensive calculations or aggregations
- You need to reuse the same dataset multiple times
- You want to materialize a filtered subset of a large table
CREATE TEMP TABLE high_salary_employees AS SELECT * FROM employees WHERE salary > 50000; SELECT name FROM high_salary_employees WHERE name LIKE '%son%';
5. Table Partitioning: Divide and Conquer
Splitting a database table into different parts is called partitioning.
Scenario:
You manage a PostgreSQL table named loan_payments that stores all EMI payments for mortgage loans. This table has over 100 million rows, growing daily as borrowers pay their EMIs.
A common query is:
SELECT * FROM loan_payments WHERE payment_date BETWEEN '2024-01-01' AND '2024-01-31';
This query is frequently run for reporting, compliance, or monthly summaries — but it becomes slower over time as the table grows.
Reason to Use Partitioning:
Partitioning improves performance and manageability by dividing a large table into smaller, more manageable parts, without changing the way users query it.
Why Use It:
1. Improved Query Performance:
- If partitioned by payment_date, the query only scans the relevant partitions (e.g., January 2024), not the entire table.
- This is known as partition pruning.
2. Faster Maintenance & Archival:
- You can drop or archive old partitions (e.g., payments older than 5 years) without locking or rewriting the entire table.
3. Parallelism:
- Many databases can scan partitions in parallel, boosting read speed for large reports.
4. Better Index Management:
- Indexes stay smaller and more efficient when applied at the partition level.
Partitioning strategies include:
1. Row-Based Partitioning:
Based on the range, we partition the table rows. Below example explains how to partition data of a table orders into two partitions.
CREATE TABLE orders ( order_id INT, order_date DATE, customer_id INT ) PARTITION BY RANGE (order_date) ( PARTITION p_2023 VALUES LESS THAN ('2024-01-01'), PARTITION p_2024 VALUES LESS THAN ('2025-01-01') );
2. Hash-Based Partitioning:
Hash-based partitioning is a method of partitioning database tables by applying a hash function to a column's value (often a primary key or other suitable field) to determine in which partition a given row should be stored. This type of partitioning can be useful when you want to evenly distribute data across multiple partitions and improve performance, particularly for large tables.
CREATE TABLE transactions ( transaction_id SERIAL PRIMARY KEY, user_id INT, amount DECIMAL, transaction_date DATE ) PARTITION BY HASH (transaction_id);
6. Vertical Scaling:
If you've implemented the above strategies but still face CPU bottlenecks or memory constraints, hardware upgrades may be necessary:
- Increase RAM: Allows more data to be cached in memory, reducing disk I/O
- Upgrade CPU: Provides more processing power for complex queries
- Switch to SSDs: Dramatically faster read/write speeds compared to HDDs
- Optimize Storage Configuration: RAID configurations for improved I/O performance
Remember that vertical scaling has physical and financial limits. At some point, you'll need to consider horizontal scaling.
7. Horizontal Scaling:
Vertical scaling has limits in terms of hardware capacity and cost, which makes horizontal scaling essential for long-term scalability.
You horizontally scale a database by splitting the data across multiple servers. This usually involves sharding.
Benefits:
- Increased capacity (storage + throughput)
- Better fault tolerance (if one node goes down, others still work)
- Can handle huge traffic loads and data volumes
Challenges:
- Data distribution logic: You need to decide where each piece of data goes
- Joins and transactions across shards become harder
- You might need custom app-level logic or middleware
8. Caching and Materialized Views:
a. Caching:
Storing frequently accessed data in memory so you don't need to fetch it from a slower source (like a database) every time. By equipping in-memory caches like Redis, you can get the frequently accessed data faster.
Because accessing data stored in secondary storage always involves I/O overhead, which results in slower access. By using caching, we can avoid I/O reads — we can read from memory itself, which is faster than I/O reads. Remember, using RAM for storage is costly, so use it wisely.
b. Materialized Views:
A saved result of a query but physically stored in the database — it’s not recomputed every time.
You want to have a dashboard showing total active mortgage loans and their average outstanding balance over the last 6 months — but this data doesn’t need to be real-time and only updates once per day.
Without a Materialized View:
Every time a user opens the dashboard, the backend runs a heavy query like:
SELECT COUNT(*) AS active_loans, AVG(outstanding_balance) FROM mortgage_loans WHERE loan_status = 'active' AND disbursement_date > NOW() - INTERVAL '6 months';
- This query scans a huge table of loan records.
- Performance is slow, especially during peak hours.
- You're wasting compute to get the same result repeatedly throughout the day.
With a Materialized View:
You create a materialized view like this:
CREATE MATERIALIZED VIEW active_loan_summary AS SELECT COUNT(*) AS active_loans, AVG(outstanding_balance) AS avg_balance FROM mortgage_loans WHERE loan_status = 'active' AND disbursement_date > NOW() - INTERVAL '6 months'; Then, schedule a refresh daily: REFRESH MATERIALIZED VIEW active_loan_summary;
Conclusion: Finding the Right Mix
Database optimization is rarely about applying a single technique, but rather implementing a combination of strategies tailored to your specific workload and requirements. Start with these steps:
- Analyze your current query performance
- Index appropriately based on your query patterns
- Rewrite queries for efficiency
- Partition large tables
- Scale vertically when necessary
- Implement caching for frequently accessed data
- Consider horizontal scaling for sustained growth
Remember that optimization is an ongoing process. As your data and access patterns evolve, your optimization strategies should too.