Quiz Solutions

Day 1: Python Fundamentals & Essentials

Multiple choice, single answer - Python Essentials

What is the main benefit of using virtual environments in Python?

  • B) They allow isolated project dependencies without conflicts

Explanation: Virtual environments create isolated Python environments where different projects can have different package versions without conflicts.

Which of the following is NOT a basic Python data structure?

  • C) Array (built-in)

Explanation: Python has lists, tuples, dictionaries, and sets as built-in data structures. Arrays are not a basic built-in type (though NumPy provides them).

What is the purpose of the if __name__ == "__main__": statement?

  • D) Both A and C

Explanation: This idiom checks if the script is run directly vs. imported as a module, preventing module code from executing during imports.

How does string slicing work in Python?

  • B) s[start:end] excludes the end index

Explanation: Python’s slice notation uses [start:end) semantics where the end index is excluded.

Which keyword is used to create a function in Python?

  • B) def

Explanation: The def keyword defines a new function in Python.

Multiple choice, single answer - Virtual Environments & HPC Modules

What does a virtual environment provide?

  • B) A sandboxed directory for project-specific packages

Explanation: Virtual environments allow you to create isolated environments with separate package installations.

What is the purpose of venv module?

  • B) It creates lightweight isolated Python environments

Explanation: The venv module is Python’s built-in tool for creating virtual environments.

What is an HPC module system used for?

  • B) Managing access to compilers, libraries, and software versions

Explanation: Module systems (like Lmod) manage the environment, making different software versions available without conflicts.

When you load an HPC module with module load gcc/11.2, what happens?

  • B) It adds the GCC compiler to your current shell environment

Explanation: Loading a module modifies environment variables (PATH, LD_LIBRARY_PATH, etc.) for the current session.

Multiple choice, single answer - Benchmarking & Profiling

What is the main purpose of benchmarking code?

  • B) To measure execution time and performance characteristics

Explanation: Benchmarking quantifies performance to identify bottlenecks and compare implementations.

Which Python module is commonly used for timing code execution?

  • C) timeit

Explanation: The timeit module is designed specifically for precise timing of small code snippets.

What does profiling reveal about your code?

  • B) Which functions consume the most CPU and memory

Explanation: Profiling tools like cProfile show where your code spends time and resources.

What is the difference between wall-clock time and CPU time?

  • A) Wall-clock time includes I/O waits; CPU time is actual computation

Explanation: Wall-clock time is real elapsed time; CPU time excludes I/O and system waiting periods.

Multiple choice, single answer - NumPy

Why are NumPy arrays more efficient than Python lists for numerical computations?

  • B) They store homogeneous data in contiguous memory blocks

Explanation: NumPy arrays use contiguous memory and fixed types, enabling vectorized operations and cache efficiency.

What does vectorization mean in NumPy?

  • A) Replacing loops with whole-array operations

Explanation: Vectorization uses built-in NumPy operations on entire arrays rather than Python loops, improving performance.

What is the shape of a 3×4×2 NumPy array?

  • C) (3, 4, 2)

Explanation: Shape is a tuple describing dimensions: 3 rows, 4 columns, 2 depth. Total elements = 3×4×2 = 24.

Which operation is most efficient in NumPy?

  • B) result = arr ** 2 (vectorized)

Explanation: Vectorized operations are implemented in compiled C code and much faster than Python loops.

Day 2: High-Performance Computing Techniques

Multiple choice, single answer - Cython

What is Cython primarily used for?

  • B) Compiling Python code to C for performance improvement

Explanation: Cython translates Python code to C, achieving significant speedups and allowing C-level optimizations.

What type hint syntax does Cython use to optimize code?

  • B) C-style type declarations like cdef int x

Explanation: Cython uses cdef for C-style variable declarations to enable static typing and optimization.

Multiple choice, single answer - Dask

What is Dask designed for?

  • B) Parallel computing with out-of-core data processing

Explanation: Dask extends NumPy/Pandas syntax for parallel and larger-than-memory computations.

How does Dask differ from NumPy?

  • B) Dask handles larger-than-memory datasets with lazy evaluation

Explanation: Dask uses lazy evaluation and distributed computing for datasets that don’t fit in RAM.

What is lazy evaluation in Dask?

  • B) Computations are deferred until explicitly requested

Explanation: Dask builds a task graph but only executes computation when .compute() is called.

Multiple choice, single answer - Numba

What does Numba do with Python functions?

  • D) Both B and C

Explanation: Numba uses LLVM to compile Python functions to machine code, enabling both speedup and parallelization.

Which decorator enables Numba JIT compilation?

  • B) @jit

Explanation: The @jit decorator from Numba enables just-in-time compilation of Python functions.

Multiple choice, single answer - SLURM & HPC Scheduling

What is SLURM?

  • B) A job scheduler and resource manager for HPC clusters

Explanation: SLURM (Simple Linux Utility for Resource Management) allocates and schedules jobs on HPC systems.

What information does an sbatch script typically specify?

  • A) Job name, number of tasks, time limit, and computation commands

Explanation: SBATCH scripts contain resource requests (#SBATCH directives) and the commands to execute.

What does the #SBATCH directive do in a submission script?

  • B) It specifies job parameters for the resource manager

Explanation: #SBATCH lines are parsed by SLURM to configure job resources (not a shell comment).

Multiple choice, single answer - Containerization

What is containerization used for in HPC?

  • B) Creating reproducible, isolated computing environments

Explanation: Containers package software and dependencies for consistent execution across systems.

What is Apptainer (formerly Singularity)?

  • B) A container platform for HPC environments

Explanation: Apptainer is designed specifically for HPC, allowing unprivileged containers on shared systems.

What is the advantage of using containers for reproducibility?

  • B) Same dependencies and environment can be deployed anywhere

Explanation: Containers ensure computational reproducibility by locking all dependencies and configurations.

Coding Challenges - Solutions

Python Essentials & Performance Measurement

Challenge 1: List vs NumPy Performance Comparison

import time
import numpy as np

# Python list approach
n = 1_000_000
lst = list(range(n))
start = time.time()
result_list = [x * 2 for x in lst]
time_list = time.time() - start

# NumPy approach
arr = np.arange(n)
start = time.time()
result_numpy = arr * 2
time_numpy = time.time() - start

speedup = time_list / time_numpy
print(f"List time: {time_list:.4f}s")
print(f"NumPy time: {time_numpy:.4f}s")
print(f"Speedup: {speedup:.1f}x")

Expected Result: NumPy is typically 10-100x faster depending on the operation.

Challenge 2: Requirements File Generator

def requirements_from_dict(packages):
    """
    Convert dictionary of packages to requirements.txt format.
    
    Args:
        packages (dict): {'package_name': 'version'} or {'package_name': None}
    
    Returns:
        str: Formatted requirements string
    """
    lines = []
    for package, version in packages.items():
        if version:
            lines.append(f"{package}=={version}")
        else:
            lines.append(package)
    return '\n'.join(lines)

# Example usage
packages = {
    'numpy': '1.24.0',
    'scipy': None,
    'matplotlib': '3.7.1'
}
print(requirements_from_dict(packages))

Expected Output:

numpy==1.24.0
scipy
matplotlib==3.7.1

Challenge 3: Simple Profiler

import time
from functools import wraps

class SimplProfiler:
    def __init__(self):
        self.times = {}
    
    def time_function(self, func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            start = time.time()
            result = func(*args, **kwargs)
            elapsed = time.time() - start
            
            if func.__name__ not in self.times:
                self.times[func.__name__] = 0
            self.times[func.__name__] += elapsed
            return result
        return wrapper
    
    def report(self):
        sorted_times = sorted(self.times.items(), key=lambda x: x[1], reverse=True)
        print("Function Profiling Results:")
        for func_name, elapsed in sorted_times:
            print(f"  {func_name}: {elapsed:.4f}s")

# Usage
profiler = SimpleProfiler()

@profiler.time_function
def slow_function():
    return sum([i**2 for i in range(1000000)])

@profiler.time_function
def fast_function():
    return sum(i**2 for i in range(100000))

slow_function()
fast_function()
profiler.report()

Cython Optimization

Challenge 3: Fibonacci with Cython

Python version (fibonacci.py):

def fib_python(n):
    if n <= 1:
        return n
    return fib_python(n-1) + fib_python(n-2)

import time
start = time.time()
result = fib_python(30)
print(f"Python: {result} in {time.time()-start:.2f}s")

Cython version (fibonacci.pyx):

def fib_cython(int n):
    cdef int a, b, i
    if n <= 1:
        return n
    a, b = 0, 1
    for i in range(2, n+1):
        a, b = b, a + b
    return b

Expected speedup: 100-1000x for recursive Fibonacci computation.

Dask Parallel Processing

Challenge 4: Dask DataFrame Aggregation

import dask.dataframe as dd
import pandas as pd
import numpy as np

# Create synthetic dataset (10GB virtual, stored on disk)
dask_df = dd.from_pandas(
    pd.DataFrame({
        'group': np.random.choice(['A', 'B', 'C', 'd'], 1000000),
        'value': np.random.randn(1000000)
    }), npartitions=10
)

# Compute aggregation
result = dask_df.groupby('group')['value'].mean().compute()
print(result)

Numba JIT Compilation

Challenge 5: Monte Carlo Pi Estimation

import numpy as np
from numba import jit
import time

# Regular Python
def monte_carlo_pi_python(n):
    inside = 0
    for _ in range(n):
        x, y = np.random.random(), np.random.random()
        if x**2 + y**2 <= 1:
            inside += 1
    return 4 * inside / n

# Numba JIT
@jit(nopython=True)
def monte_carlo_pi_numba(n):
    inside = 0
    for _ in range(n):
        x, y = np.random.random(), np.random.random()
        if x**2 + y**2 <= 1:
            inside += 1
    return 4 * inside / n

# Benchmark
n = 10_000_000
start = time.time()
pi_python = monte_carlo_pi_python(n)
time_python = time.time() - start

start = time.time()
pi_numba = monte_carlo_pi_numba(n)
time_numba = time.time() - start

print(f"Python: π ≈ {pi_python:.4f} in {time_python:.2f}s")
print(f"Numba:  π ≈ {pi_numba:.4f} in {time_numba:.4f}s")
print(f"Speedup: {time_python/time_numba:.1f}x")

Expected speedup: 50-200x for Numba vs Python.

SLURM Job Submission

Challenge 6: SBATCH Script Template

#!/bin/bash
#SBATCH --job-name=python_job
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --time=00:10:00
#SBATCH --output=job_%j.log
#SBATCH --error=job_%j.err

# Load modules
module load python/3.11
module load gcc/11.2

# Create and activate virtual environment if needed
python -m venv job_env
source job_env/bin/activate
pip install numpy scipy matplotlib

# Run Python script with timing
echo "Job started at: $(date)"
python -c "
import numpy as np
import time

start = time.time()
arr = np.random.randn(100000000)
result = np.sum(arr ** 2)
elapsed = time.time() - start

print(f'Computation time: {elapsed:.2f}s')
print(f'Result: {result:.2e}')
"
echo "Job ended at: $(date)"

To submit: sbatch script.sbatch To check status: squeue -u $USER To view output: cat job_JOBID.log