InterviewQAs

Python File Handling

Download as PDF
All questions in this page are included
Preparing…
Download PDF
PFH
Python File Handling

Python file handling becomes significantly more important once applications move beyond toy examples and start processing logs, reports, exports, uploads, or transactional data. Real-world systems often deal with partial writes, encoding inconsistencies, concurrent access, and large files that cannot be fully loaded into memory safely.

Experienced developers typically focus less on simply opening and closing files and more on reliability concerns such as atomic writes, buffering strategies, exception handling, file locking, and cross-platform compatibility. Small mistakes in file operations can quietly corrupt data or create difficult-to-debug production failures.

Modern Python applications frequently combine file handling with APIs, ETL pipelines, cloud synchronization, audit logging, and streaming data ingestion. In these scenarios, understanding how context managers, binary mode, temporary files, generators, and chunk-based processing work together becomes essential for building scalable solutions.

Encoding management is another area where practical expertise matters. Production files often arrive from multiple operating systems, legacy enterprise tools, or external vendors using inconsistent encodings and line endings. Robust implementations anticipate these variations instead of assuming UTF-8 text with clean formatting.

Efficient file handling also directly affects application performance. Reading multi-gigabyte files line by line, minimizing disk I/O, preventing descriptor leaks, and choosing the right serialization format can dramatically reduce memory usage and improve throughput in data-intensive systems.

Question 01

Why is using a context manager with file operations considered safer in production systems compared to manually opening and closing files?

MEDIUM

Using a context manager through the "with" statement guarantees that file resources are released correctly even if an exception occurs midway through processing. In production systems, unexpected failures such as malformed input, permission issues, or interrupted writes are common. Without automatic cleanup, applications can leave file descriptors open, eventually exhausting operating system limits.

A common issue in long-running ETL or logging services is descriptor leakage caused by forgotten close() calls. This may not appear during testing because the process lifetime is short, but under sustained traffic the application can fail with "Too many open files" errors. Context managers eliminate that entire category of mistakes.

Another advantage is readability and maintainability. When developers see a "with open(...)" block, they immediately understand the file scope and lifecycle. That reduces ambiguity during code reviews and makes nested operations involving temporary files, compressed streams, or network-mounted storage easier to reason about.

Question 02

Which file mode is most appropriate when appending new log entries to an existing UTF-8 text log file without overwriting previous content?

EASY
  • A open('app.log', 'w')
  • B open('app.log', 'a', encoding='utf-8')
  • C open('app.log', 'r+')
  • D open('app.log', 'x')

Append mode adds new content to the end of the file while preserving existing data. This is the standard approach for application logs, audit trails, and incremental exports where historical information must remain intact.

Using write mode ('w') would truncate the file immediately, destroying existing log history. Exclusive creation mode ('x') fails if the file already exists, while 'r+' requires careful pointer management and is unnecessarily risky for straightforward append-only logging scenarios.

Question 03

Write a Python program that reads a very large text file efficiently and counts how many lines contain the word 'ERROR' without loading the entire file into memory.

MEDIUM

This implementation processes the file line by line instead of calling read() or readlines(), both of which could consume excessive memory when working with multi-gigabyte log files. Python file objects are iterable, making streaming-style processing efficient and straightforward.

The errors='ignore' parameter is practical in enterprise environments where log files may contain malformed characters from external systems or mixed encodings. In monitoring pipelines and operational tooling, partial processing is often preferable to failing the entire job because of one corrupted line.

# Python
from pathlib import Path


def count_error_lines(file_path):
    error_count = 0

    with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
        for line_number, line in enumerate(file, start=1):
            if 'ERROR' in line:
                error_count += 1

    return error_count


log_file = Path('application.log')
count = count_error_lines(log_file)

print(f'Total ERROR lines: {count}')
Question 04

What problems can occur when multiple processes write to the same file simultaneously, and how can Python applications reduce the risk of corruption?

HARD

Concurrent writes can produce interleaved or partially overwritten data because file operations are not automatically atomic across processes. For example, two services writing JSON lines into the same audit file may overlap writes, resulting in malformed records that downstream systems cannot parse.

Another issue involves race conditions around file rotation or truncation. One process may attempt to write while another renames or archives the file, causing lost records or inconsistent state. These failures are difficult to reproduce because they depend heavily on timing and operating system behavior.

Python applications typically reduce these risks using file locks, append-only strategies, transactional temp-file replacement patterns, or centralized logging systems. On Linux, advisory locks through modules such as fcntl are common, while distributed environments often avoid shared-file writes entirely and instead stream events into databases, queues, or log aggregation platforms.

Question 05

Which statements about binary file handling in Python are correct?

MEDIUM
  • A Binary mode avoids automatic newline translation.
  • B Binary mode is required for working with images or PDFs.
  • C Binary mode automatically converts bytes into strings.
  • D Binary reads always return Unicode text.

Binary mode works directly with raw bytes and avoids text-processing behavior such as newline conversion and character decoding. This is critical for non-text formats like images, ZIP archives, PDFs, and encrypted payloads where even small byte modifications can corrupt the file.

A common production mistake is opening binary content in text mode, especially when handling uploads or API attachments. Developers may not notice issues immediately because small files sometimes appear readable, but checksum mismatches and corruption typically surface later during validation or downstream processing.

Question 06

Write a Python script that safely writes user report data into a file using a temporary file replacement strategy.

EASY

This pattern minimizes the risk of partially written files. Instead of writing directly to the target file, the application writes to a temporary file first and then atomically replaces the destination. If the process crashes midway, the original file remains untouched.

Atomic replacement strategies are heavily used in configuration management systems, backup utilities, and scheduled reporting jobs. They help maintain consistency when applications are interrupted by server restarts, deployment rollbacks, or infrastructure failures.

# Python
import os
import tempfile


def safe_write_report(final_path, content):
    directory = os.path.dirname(final_path) or '.'

    with tempfile.NamedTemporaryFile(
        mode='w',
        encoding='utf-8',
        dir=directory,
        delete=False
    ) as temp_file:
        temp_file.write(content)
        temp_name = temp_file.name

    os.replace(temp_name, final_path)


report_content = 'Daily revenue report generated successfully.\n'
safe_write_report('daily_report.txt', report_content)

print('Report written safely.')
Question 07

How should Python applications handle encoding inconsistencies when processing files from multiple external systems?

MEDIUM

Applications that ingest files from vendors, legacy software, or regional systems should never assume consistent encoding. Although UTF-8 has become the dominant standard, many enterprise exports still use encodings such as ISO-8859-1, Windows-1252, or UTF-16. Failing to account for this often results in UnicodeDecodeError exceptions during production processing.

A practical approach is to standardize incoming files into a canonical internal encoding such as UTF-8 while logging encoding anomalies for operational visibility. Developers also commonly use fallback decoding strategies or libraries that detect probable encodings when source systems are unreliable.

Another important detail is preserving data integrity during error handling. Silently dropping characters may hide corruption, especially in financial or healthcare systems. Many teams therefore prefer explicit replacement markers, quarantine workflows, or validation reports instead of ignoring decoding issues entirely.

Question 08

Which practices improve performance and reliability when processing multi-gigabyte files in Python?

HARD
  • A Process data in chunks instead of loading entire files into memory.
  • B Use generators or iterators for streaming workflows.
  • C Always call readlines() for better performance.
  • D Use buffered writes for repeated output operations.

Large-scale file processing is primarily about controlling memory pressure and reducing unnecessary I/O operations. Chunk-based reading and iterator-driven pipelines allow applications to process datasets much larger than available RAM while maintaining predictable resource usage.

Buffered writes also improve throughput because the operating system can batch disk operations more efficiently. In contrast, readlines() can become dangerous with very large files because it attempts to load all lines into memory at once, potentially causing memory exhaustion or container restarts in production environments.

Question 09

Write a Python program that copies a large binary file in chunks while displaying progress information.

HARD

The program reads and writes the file incrementally using fixed-size chunks. This prevents excessive memory usage and allows the application to handle very large binary files such as virtual machine images, backups, or video archives efficiently.

Chunk-based copying is common in infrastructure tooling, cloud migration utilities, and backup platforms. The progress output is especially useful for operational visibility because large transfers may run for several minutes or hours depending on storage performance and network latency.

# Python
from pathlib import Path


def copy_large_file(source_path, destination_path, chunk_size=1024 * 1024):
    source = Path(source_path)
    destination = Path(destination_path)

    total_size = source.stat().st_size
    copied = 0

    with open(source, 'rb') as src, open(destination, 'wb') as dst:
        while True:
            chunk = src.read(chunk_size)

            if not chunk:
                break

            dst.write(chunk)
            copied += len(chunk)

            progress = (copied / total_size) * 100
            print(f'Copied: {progress:.2f}%')

    print('File copy completed.')


copy_large_file('backup.iso', 'backup_copy.iso')
Question 10

Write a Python script that monitors a log file and continuously prints new lines as they are appended, similar to the Unix tail -f command.

HARD

The script moves the file pointer to the end of the file using seek(0, 2), then continuously checks for newly appended content. This approach is lightweight and commonly used for real-time monitoring tools, operational dashboards, and debugging utilities.

In production systems, variations of this pattern are often integrated with alerting pipelines, centralized logging systems, or security monitoring workflows. One practical consideration is handling log rotation, where the original file may be renamed and recreated by external processes.

# Python
import time


def follow_log(file_path):
    with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
        file.seek(0, 2)

        while True:
            line = file.readline()

            if not line:
                time.sleep(0.5)
                continue

            print(line.strip())


follow_log('server.log')
Question 11

What is the difference between text mode and binary mode when opening files in Python?

EASY

Text mode interprets file content as strings, automatically handling newline characters and encoding. Binary mode treats the file content as raw bytes without conversion or newline translation.

Text mode is suitable for files with readable characters, like logs or CSVs, while binary mode is essential for files like images, PDFs, or encrypted data where exact byte representation is crucial.

Question 12

Which methods are safe for appending lines to a file without overwriting existing content?

MEDIUM
  • A file.write() in 'a' mode
  • B file.writelines() in 'w' mode
  • C file.write() in 'r+' mode after seeking to the end
  • D file.write() in 'x' mode

Append mode ('a') automatically positions the file pointer at the end, preventing overwrites.

In 'r+' mode, manually seeking to the end before writing achieves the same append effect. 'w' mode truncates the file, and 'x' mode fails if the file exists.

Question 13

Write a Python program to read a CSV file and sum the values of a specific numeric column efficiently.

MEDIUM

The code uses DictReader for readability and handles large files line by line to conserve memory.

Exceptions handle missing or malformed data gracefully, which is common when processing real-world CSV files from multiple sources.

# Python
import csv


def sum_column(file_path, column_name):
    total = 0
    with open(file_path, 'r', encoding='utf-8') as f:
        reader = csv.DictReader(f)
        for row in reader:
            try:
                total += float(row[column_name])
            except (ValueError, KeyError):
                continue
    return total

print(sum_column('sales.csv', 'Revenue'))
Question 14

Explain the role of file buffering in Python and when you might want to disable it.

HARD

Buffering temporarily stores file data in memory before writing to disk or after reading from disk. It improves performance by reducing the number of system calls for I/O.

In real-time applications, logs, or streaming data, you may disable buffering (buffering=0 or 1) to ensure immediate disk writes. Conversely, in batch processing, buffering improves throughput and reduces CPU overhead.

Question 15

Which Python techniques help ensure atomic file writes?

HARD
  • A Writing to a temporary file and renaming it
  • B Using file.flush() after each write
  • C Using os.replace() to replace files atomically
  • D Opening the file in 'r' mode

Atomic operations prevent partially written files in case of crashes. Writing to a temp file and renaming, or using os.replace(), ensures either the full new content exists or the old content remains intact.

File flush improves in-memory write visibility but does not protect against crashes, and 'r' mode cannot write.

Question 16

Write Python code to read a large text file in reverse order, line by line.

MEDIUM

This approach reads the file backward to avoid loading the entire content into memory, suitable for large logs where the most recent lines are processed first.

Byte-level operations ensure correctness regardless of line length and help maintain performance for multi-gigabyte files.

# Python
import os


def read_file_reverse(file_path):
    with open(file_path, 'rb') as f:
        f.seek(0, os.SEEK_END)
        position = f.tell()
        buffer = bytearray()

        while position >= 0:
            f.seek(position)
            char = f.read(1)
            if char == b'\n' and buffer:
                print(buffer[::-1].decode('utf-8', errors='ignore'))
                buffer = bytearray()
            else:
                buffer.append(char[0])
            position -= 1
        if buffer:
            print(buffer[::-1].decode('utf-8', errors='ignore'))
Question 17

When reading and writing JSON files in Python, which considerations are important?

MEDIUM
  • A Always open in text mode with correct encoding
  • B Use json.load() and json.dump() for parsing and writing
  • C Binary mode is required for JSON
  • D Consider atomic writes to avoid partial file corruption

JSON is a text format, so files should be opened in text mode. Incorrect encoding can result in Unicode errors.

Using json.load() and json.dump() ensures valid parsing and writing, while atomic writes prevent partially written JSON files in case of process interruption.

Question 18

What is the purpose of the 'seek' and 'tell' methods in Python file objects?

EASY

The seek() method moves the file pointer to a specific byte offset within the file, allowing non-sequential reads or writes.

The tell() method returns the current file pointer position. Together, they help implement features like resuming reading, random access to files, or efficiently skipping large sections of data.

Question 19

Write a Python function that monitors a directory for new files and prints the filename when added.

HARD

The script continuously scans the directory and identifies new files by comparing the current set with the previously seen set. This is a simple polling mechanism suitable for monitoring log directories or upload folders.

For production systems, this can be enhanced with OS-specific watchers (like inotify on Linux) for better performance and lower CPU usage.

# Python
import time
from pathlib import Path


def watch_directory(directory_path):
    seen = set(Path(directory_path).iterdir())
    while True:
        current = set(Path(directory_path).iterdir())
        new_files = current - seen
        for file in new_files:
            print(f'New file detected: {file.name}')
        seen = current
        time.sleep(1)

watch_directory('./logs')
Question 20

Write Python code to safely read a gzip-compressed text file line by line.

MEDIUM

gzip.open() in text mode ('rt') allows reading compressed files line by line, preserving memory efficiency.

This approach is useful in log processing, analytics pipelines, and ETL systems where compressed archives must be read without decompressing entire files to disk.

# Python
import gzip


def read_gzip(file_path):
    with gzip.open(file_path, 'rt', encoding='utf-8', errors='ignore') as f:
        for line in f:
            print(line.strip())

read_gzip('logs.gz')
Question 21

Why is pathlib considered more maintainable than using raw string paths with the os module in modern Python applications?

MEDIUM

The pathlib module provides an object-oriented interface for filesystem paths, making file-related code easier to read and less error-prone. Instead of manually concatenating path strings with slashes or backslashes, developers work with Path objects that automatically handle platform-specific separators.

In larger applications, pathlib improves maintainability because path operations become more expressive. Methods such as exists(), suffix, mkdir(), and glob() make the intent of the code clearer during reviews and debugging.

Another practical advantage is cross-platform compatibility. Developers working across Linux, macOS, and Windows environments avoid many path-formatting issues because pathlib abstracts operating system differences behind a consistent API.

Question 22

Which operations are commonly used to reduce memory consumption when processing very large files?

MEDIUM
  • A Iterating over the file object directly
  • B Using read() to load the entire file immediately
  • C Processing fixed-size chunks
  • D Using generators for transformation pipelines

Efficient file processing avoids loading entire datasets into memory unnecessarily. Iterating over files line by line or reading fixed-size chunks keeps memory usage predictable even for multi-gigabyte files.

Generators are especially useful in ETL pipelines because they allow records to flow through transformations incrementally. In contrast, read() can become dangerous for very large files because it allocates memory for the complete file contents at once.

Question 23

Write a Python script that merges multiple text log files into a single output file while preserving line order within each file.

MEDIUM

The implementation streams lines from each input file directly into the destination file without loading entire logs into memory. This approach scales efficiently when processing operational logs or exported reports.

The errors='ignore' parameter helps prevent processing failures caused by malformed characters from external systems. In production environments, merged logs are often used for centralized troubleshooting or audit analysis.

# Python
from pathlib import Path


def merge_logs(input_files, output_file):
    with open(output_file, 'w', encoding='utf-8') as outfile:
        for file_name in input_files:
            with open(file_name, 'r', encoding='utf-8', errors='ignore') as infile:
                for line in infile:
                    outfile.write(line)


log_files = ['app1.log', 'app2.log', 'app3.log']
merge_logs(log_files, 'merged.log')

print('Logs merged successfully.')
Question 24

What risks are associated with writing directly to configuration files in production systems?

HARD

Writing directly to configuration files introduces the risk of partial writes if the process crashes, the disk fills up, or the server restarts during the operation. A partially written configuration file may render an application unable to start correctly after deployment.

Another issue involves concurrent modification. If multiple services or deployment scripts update the same configuration file simultaneously, changes may overwrite each other or produce invalid content structures.

Production systems typically mitigate these risks using temporary-file replacement patterns, backup copies, schema validation, and versioned configuration management. Many teams also validate generated configuration files before replacing active ones to avoid introducing startup failures.

Question 25

Which statement about the flush() method in Python file handling is correct?

EASY
  • A flush() permanently closes the file
  • B flush() forces buffered data to be written to the operating system
  • C flush() deletes file content after writing
  • D flush() automatically changes the file mode to append

The flush() method pushes buffered data from the Python process to the operating system layer so that recently written content becomes visible sooner.

This is useful in long-running processes such as logging systems or streaming applications where waiting for automatic buffer flushing may delay visibility of critical events.

Question 26

Write a Python function that splits a very large text file into smaller files containing a fixed number of lines each.

HARD

The function processes the source file incrementally, ensuring memory usage remains low regardless of the original file size. Each output file contains a fixed number of lines, making large datasets easier to archive or distribute.

This pattern is commonly used in ETL pipelines, batch processing systems, and cloud uploads where downstream systems enforce file size or record count limits.

# Python
from pathlib import Path


def split_file(source_file, lines_per_file=1000):
    file_count = 1
    current_line = 0
    output = None

    with open(source_file, 'r', encoding='utf-8', errors='ignore') as infile:
        for line in infile:
            if current_line % lines_per_file == 0:
                if output:
                    output.close()

                output_name = f'split_{file_count}.txt'
                output = open(output_name, 'w', encoding='utf-8')
                file_count += 1

            output.write(line)
            current_line += 1

    if output:
        output.close()


split_file('large_dataset.txt', 5000)
print('File split completed.')
Question 27

Which situations typically require opening a file in binary mode?

HARD
  • A Working with image files
  • B Reading encrypted payloads
  • C Processing plain UTF-8 configuration files
  • D Handling compressed archives

Binary mode preserves the exact byte structure of a file without newline translation or text decoding. This is essential for images, archives, encrypted payloads, and other non-text formats.

Opening binary content in text mode can silently alter bytes or trigger decoding errors, which may corrupt files or invalidate checksums during transfers and backups.

Question 28

How do temporary files improve reliability in data-processing applications?

MEDIUM

Temporary files provide a safe intermediate workspace during operations such as report generation, file transformation, and data exports. Instead of modifying the original file directly, the application writes the updated content separately and replaces the original only after the operation succeeds.

This strategy reduces the chance of leaving corrupted or incomplete files behind if a process crashes midway. It also simplifies rollback and troubleshooting because the original content remains untouched until validation completes.

Temporary-file workflows are heavily used in deployment systems, backup utilities, and ETL pipelines where reliability matters more than raw write speed. They are particularly important when multiple systems depend on the same shared files.

Question 29

Write Python code that calculates the size of every file in a directory and prints the results in descending order.

MEDIUM

The script uses pathlib to iterate through directory contents and collect file sizes using stat(). Sorting the results in descending order helps identify unusually large files quickly.

Operational teams commonly use similar utilities to troubleshoot disk usage issues, monitor log growth, and locate oversized export files consuming storage unexpectedly.

# Python
from pathlib import Path


def list_file_sizes(directory):
    files = []

    for path in Path(directory).iterdir():
        if path.is_file():
            files.append((path.name, path.stat().st_size))

    files.sort(key=lambda item: item[1], reverse=True)

    for name, size in files:
        print(f'{name}: {size} bytes')


list_file_sizes('./data')
Question 30

Write a Python script that continuously backs up a text file whenever its content changes.

HARD

The script monitors the file modification timestamp and creates an updated backup whenever the source file changes. Using copy2() preserves metadata such as timestamps and permissions.

This lightweight monitoring pattern is useful for protecting locally edited documents, configuration files, or generated reports in environments where full backup infrastructure is unavailable.

# Python
import shutil
import time
from pathlib import Path


def monitor_and_backup(source_file, backup_file):
    source = Path(source_file)
    previous_modified = None

    while True:
        current_modified = source.stat().st_mtime

        if previous_modified is None:
            previous_modified = current_modified

        if current_modified != previous_modified:
            shutil.copy2(source, backup_file)
            previous_modified = current_modified
            print('Backup updated.')

        time.sleep(2)


monitor_and_backup('notes.txt', 'notes_backup.txt')