Quiz Solutions¶
Day 1: Python Fundamentals & Essentials¶
Multiple choice, single answer - Python Essentials
What is the main benefit of using virtual environments in Python?
B) They allow isolated project dependencies without conflicts
Explanation: Virtual environments create isolated Python environments where different projects can have different package versions without conflicts.
Which of the following is NOT a basic Python data structure?
C) Array (built-in)
Explanation: Python has lists, tuples, dictionaries, and sets as built-in data structures. Arrays are not a basic built-in type (though NumPy provides them).
What is the purpose of the if __name__ == "__main__": statement?
D) Both A and C
Explanation: This idiom checks if the script is run directly vs. imported as a module, preventing module code from executing during imports.
How does string slicing work in Python?
B) s[start:end] excludes the end index
Explanation: Python’s slice notation uses [start:end) semantics where the end index is excluded.
Which keyword is used to create a function in Python?
B) def
Explanation: The def keyword defines a new function in Python.
Multiple choice, single answer - Virtual Environments & HPC Modules
What does a virtual environment provide?
B) A sandboxed directory for project-specific packages
Explanation: Virtual environments allow you to create isolated environments with separate package installations.
What is the purpose of venv module?
B) It creates lightweight isolated Python environments
Explanation: The venv module is Python’s built-in tool for creating virtual environments.
What is an HPC module system used for?
B) Managing access to compilers, libraries, and software versions
Explanation: Module systems (like Lmod) manage the environment, making different software versions available without conflicts.
When you load an HPC module with module load gcc/11.2, what happens?
B) It adds the GCC compiler to your current shell environment
Explanation: Loading a module modifies environment variables (PATH, LD_LIBRARY_PATH, etc.) for the current session.
Multiple choice, single answer - Benchmarking & Profiling
What is the main purpose of benchmarking code?
B) To measure execution time and performance characteristics
Explanation: Benchmarking quantifies performance to identify bottlenecks and compare implementations.
Which Python module is commonly used for timing code execution?
C) timeit
Explanation: The timeit module is designed specifically for precise timing of small code snippets.
What does profiling reveal about your code?
B) Which functions consume the most CPU and memory
Explanation: Profiling tools like cProfile show where your code spends time and resources.
What is the difference between wall-clock time and CPU time?
A) Wall-clock time includes I/O waits; CPU time is actual computation
Explanation: Wall-clock time is real elapsed time; CPU time excludes I/O and system waiting periods.
Multiple choice, single answer - NumPy
Why are NumPy arrays more efficient than Python lists for numerical computations?
B) They store homogeneous data in contiguous memory blocks
Explanation: NumPy arrays use contiguous memory and fixed types, enabling vectorized operations and cache efficiency.
What does vectorization mean in NumPy?
A) Replacing loops with whole-array operations
Explanation: Vectorization uses built-in NumPy operations on entire arrays rather than Python loops, improving performance.
What is the shape of a 3×4×2 NumPy array?
C) (3, 4, 2)
Explanation: Shape is a tuple describing dimensions: 3 rows, 4 columns, 2 depth. Total elements = 3×4×2 = 24.
Which operation is most efficient in NumPy?
B)
result = arr ** 2(vectorized)
Explanation: Vectorized operations are implemented in compiled C code and much faster than Python loops.
Day 2: High-Performance Computing Techniques¶
Multiple choice, single answer - Cython
What is Cython primarily used for?
B) Compiling Python code to C for performance improvement
Explanation: Cython translates Python code to C, achieving significant speedups and allowing C-level optimizations.
What type hint syntax does Cython use to optimize code?
B) C-style type declarations like
cdef int x
Explanation: Cython uses cdef for C-style variable declarations to enable static typing and optimization.
Multiple choice, single answer - Dask
What is Dask designed for?
B) Parallel computing with out-of-core data processing
Explanation: Dask extends NumPy/Pandas syntax for parallel and larger-than-memory computations.
How does Dask differ from NumPy?
B) Dask handles larger-than-memory datasets with lazy evaluation
Explanation: Dask uses lazy evaluation and distributed computing for datasets that don’t fit in RAM.
What is lazy evaluation in Dask?
B) Computations are deferred until explicitly requested
Explanation: Dask builds a task graph but only executes computation when .compute() is called.
Multiple choice, single answer - Numba
What does Numba do with Python functions?
D) Both B and C
Explanation: Numba uses LLVM to compile Python functions to machine code, enabling both speedup and parallelization.
Which decorator enables Numba JIT compilation?
B) @jit
Explanation: The @jit decorator from Numba enables just-in-time compilation of Python functions.
Multiple choice, single answer - SLURM & HPC Scheduling
What is SLURM?
B) A job scheduler and resource manager for HPC clusters
Explanation: SLURM (Simple Linux Utility for Resource Management) allocates and schedules jobs on HPC systems.
What information does an sbatch script typically specify?
A) Job name, number of tasks, time limit, and computation commands
Explanation: SBATCH scripts contain resource requests (#SBATCH directives) and the commands to execute.
What does the #SBATCH directive do in a submission script?
B) It specifies job parameters for the resource manager
Explanation: #SBATCH lines are parsed by SLURM to configure job resources (not a shell comment).
Multiple choice, single answer - Containerization
What is containerization used for in HPC?
B) Creating reproducible, isolated computing environments
Explanation: Containers package software and dependencies for consistent execution across systems.
What is Apptainer (formerly Singularity)?
B) A container platform for HPC environments
Explanation: Apptainer is designed specifically for HPC, allowing unprivileged containers on shared systems.
What is the advantage of using containers for reproducibility?
B) Same dependencies and environment can be deployed anywhere
Explanation: Containers ensure computational reproducibility by locking all dependencies and configurations.
Coding Challenges - Solutions¶
Python Essentials & Performance Measurement¶
Challenge 1: List vs NumPy Performance Comparison
import time
import numpy as np
# Python list approach
n = 1_000_000
lst = list(range(n))
start = time.time()
result_list = [x * 2 for x in lst]
time_list = time.time() - start
# NumPy approach
arr = np.arange(n)
start = time.time()
result_numpy = arr * 2
time_numpy = time.time() - start
speedup = time_list / time_numpy
print(f"List time: {time_list:.4f}s")
print(f"NumPy time: {time_numpy:.4f}s")
print(f"Speedup: {speedup:.1f}x")
Expected Result: NumPy is typically 10-100x faster depending on the operation.
Challenge 2: Requirements File Generator
def requirements_from_dict(packages):
"""
Convert dictionary of packages to requirements.txt format.
Args:
packages (dict): {'package_name': 'version'} or {'package_name': None}
Returns:
str: Formatted requirements string
"""
lines = []
for package, version in packages.items():
if version:
lines.append(f"{package}=={version}")
else:
lines.append(package)
return '\n'.join(lines)
# Example usage
packages = {
'numpy': '1.24.0',
'scipy': None,
'matplotlib': '3.7.1'
}
print(requirements_from_dict(packages))
Expected Output:
numpy==1.24.0
scipy
matplotlib==3.7.1
Challenge 3: Simple Profiler
import time
from functools import wraps
class SimplProfiler:
def __init__(self):
self.times = {}
def time_function(self, func):
@wraps(func)
def wrapper(*args, **kwargs):
start = time.time()
result = func(*args, **kwargs)
elapsed = time.time() - start
if func.__name__ not in self.times:
self.times[func.__name__] = 0
self.times[func.__name__] += elapsed
return result
return wrapper
def report(self):
sorted_times = sorted(self.times.items(), key=lambda x: x[1], reverse=True)
print("Function Profiling Results:")
for func_name, elapsed in sorted_times:
print(f" {func_name}: {elapsed:.4f}s")
# Usage
profiler = SimpleProfiler()
@profiler.time_function
def slow_function():
return sum([i**2 for i in range(1000000)])
@profiler.time_function
def fast_function():
return sum(i**2 for i in range(100000))
slow_function()
fast_function()
profiler.report()
Cython Optimization¶
Challenge 3: Fibonacci with Cython
Python version (fibonacci.py):
def fib_python(n):
if n <= 1:
return n
return fib_python(n-1) + fib_python(n-2)
import time
start = time.time()
result = fib_python(30)
print(f"Python: {result} in {time.time()-start:.2f}s")
Cython version (fibonacci.pyx):
def fib_cython(int n):
cdef int a, b, i
if n <= 1:
return n
a, b = 0, 1
for i in range(2, n+1):
a, b = b, a + b
return b
Expected speedup: 100-1000x for recursive Fibonacci computation.
Dask Parallel Processing¶
Challenge 4: Dask DataFrame Aggregation
import dask.dataframe as dd
import pandas as pd
import numpy as np
# Create synthetic dataset (10GB virtual, stored on disk)
dask_df = dd.from_pandas(
pd.DataFrame({
'group': np.random.choice(['A', 'B', 'C', 'd'], 1000000),
'value': np.random.randn(1000000)
}), npartitions=10
)
# Compute aggregation
result = dask_df.groupby('group')['value'].mean().compute()
print(result)
Numba JIT Compilation¶
Challenge 5: Monte Carlo Pi Estimation
import numpy as np
from numba import jit
import time
# Regular Python
def monte_carlo_pi_python(n):
inside = 0
for _ in range(n):
x, y = np.random.random(), np.random.random()
if x**2 + y**2 <= 1:
inside += 1
return 4 * inside / n
# Numba JIT
@jit(nopython=True)
def monte_carlo_pi_numba(n):
inside = 0
for _ in range(n):
x, y = np.random.random(), np.random.random()
if x**2 + y**2 <= 1:
inside += 1
return 4 * inside / n
# Benchmark
n = 10_000_000
start = time.time()
pi_python = monte_carlo_pi_python(n)
time_python = time.time() - start
start = time.time()
pi_numba = monte_carlo_pi_numba(n)
time_numba = time.time() - start
print(f"Python: π ≈ {pi_python:.4f} in {time_python:.2f}s")
print(f"Numba: π ≈ {pi_numba:.4f} in {time_numba:.4f}s")
print(f"Speedup: {time_python/time_numba:.1f}x")
Expected speedup: 50-200x for Numba vs Python.
SLURM Job Submission¶
Challenge 6: SBATCH Script Template
#!/bin/bash
#SBATCH --job-name=python_job
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --time=00:10:00
#SBATCH --output=job_%j.log
#SBATCH --error=job_%j.err
# Load modules
module load python/3.11
module load gcc/11.2
# Create and activate virtual environment if needed
python -m venv job_env
source job_env/bin/activate
pip install numpy scipy matplotlib
# Run Python script with timing
echo "Job started at: $(date)"
python -c "
import numpy as np
import time
start = time.time()
arr = np.random.randn(100000000)
result = np.sum(arr ** 2)
elapsed = time.time() - start
print(f'Computation time: {elapsed:.2f}s')
print(f'Result: {result:.2e}')
"
echo "Job ended at: $(date)"
To submit: sbatch script.sbatch
To check status: squeue -u $USER
To view output: cat job_JOBID.log