Last Updated: March 13, 2026
AI systems constantly interact with data. Whether you are loading datasets, processing logs, storing model outputs, or reading configuration files, the ability to work effectively with files and structured data is essential.
Python provides simple and powerful tools for reading and writing files, as well as handling common data formats such as JSON, CSV, and text files. These capabilities form the foundation of many AI workflows, including data preprocessing, experiment tracking, and model pipelines.
In this chapter, you will learn how to read, write, and manipulate files in Python, along with practical techniques for working with common data formats.
If you have been writing Python for a while, you might be used to os.path.join(), os.path.exists(), etc. They work, but they are clunky. Every path operation requires importing os and calling a function that takes a string. The result is more string, not an object you can work with.
pathlib, introduced in Python 3.4, replaces all of that with Path objects. Paths become first-class citizens with methods and operators instead of bare strings passed through utility functions.
That / operator is not division. pathlib overloads it for path joining, which reads much more naturally than os.path.join("data", "documents"). It also works across operating systems, so you do not need to worry about backslashes on Windows.
Every Path object gives you easy access to parts of the path:
These are properties, not method calls, so there are no parentheses. You can use .stem and .suffix when processing batches of files, for example, to convert report.pdf into report.txt after extraction.
For small files, pathlib gives you one-liner convenience methods to work with files:
These methods handle opening and closing the file for you. No need for open(), no need for with statements. For quick reads and writes of small files, this is the cleanest approach.
The parents=True flag is the equivalent of mkdir -p in the shell. Without it, Python raises an error if the parent directory does not exist. The exist_ok=True flag prevents errors if the directory is already there.
When you need to find files matching a pattern, pathlib has .glob() and .rglob() built in:
The difference between .glob() and .rglob() is that .rglob() searches subdirectories recursively. This is the function you will use when building document loaders that need to ingest every PDF in a nested folder structure.
pathlib's .read_text() and .write_text() are great for small files, but when you need more control, like reading line by line, appending to a file, or working with binary data, you need open(). And whenever you use open(), you should use a context manager.
Here is the naive approach to reading a file:
This works until it does not. If f.read() raises an exception, f.close() never runs. The file handle leaks. Do that enough times in a long-running process, and you hit the operating system's limit on open file descriptors, which crashes your program with a cryptic error.
The with statement fixes this by guaranteeing cleanup:
No matter what happens inside the with block, whether the code succeeds, raises an exception, or hits a return statement, the file gets closed. This guarantee is what makes context managers essential for any resource that needs cleanup.
You will encounter situations where you want the same guarantee for your own resources, like a database connection, a temporary directory, or a timer. Python's contextlib module makes this straightforward:
The yield statement is the key. Everything before yield runs when entering the with block, and everything after yield runs when exiting. You will see this pattern in AI frameworks too, for example, when managing GPU memory or tracking experiment runs.
JSON is the lingua franca of AI engineering. LLM API responses come back as JSON. Configuration files are JSON. Evaluation datasets are often JSON or JSONL (one JSON object per line). You will work with JSON constantly.
Python's json module has four core functions, and the naming convention may look confusing initially:
The mnemonic: loads and dumps work with strings. load and dump work with files.
Raw JSON dumps are unreadable for humans:
Use indent=2 when writing config files or saving evaluation results that humans will read. Skip it for data that only machines consume, since the extra whitespace adds up at scale.
When working with multilingual data or special characters, you will occasionally hit encoding problems. Always specify UTF-8 explicitly:
The ensure_ascii=False flag tells json.dumps to write Unicode characters directly instead of escaping them as \uXXXX. This makes the output file readable for humans who speak those languages.
In AI workflows, you will often encounter JSONL format, where each line is a separate JSON object. This is the standard format for fine-tuning datasets, evaluation sets, and log files:
JSONL has a practical advantage over regular JSON arrays: you can append new records without loading the entire file into memory, and you can process records one at a time. That matters when your dataset has millions of rows.
CSV files show up less often than JSON in AI work, but they are still common for tabular datasets, evaluation metrics, and exported spreadsheet data.
The csv module's DictReader is far more pleasant to use than the basic reader, because it gives you dictionaries keyed by column names instead of positional lists:
One thing to watch: DictReader returns all values as strings, even numbers. You will need to cast them yourself (float(row['accuracy'])) if you need to do math. Libraries like pandas handle this automatically, but for simple use cases, the built-in csv module is lighter and has zero dependencies.
Almost every AI application needs to call external APIs like LLM providers, embedding services, vector databases, and data sources. The requests library is the standard way to do this in Python.
Notice the json= parameter in the POST request. This is a convenience feature that automatically serializes your dictionary to JSON and sets the Content-Type header. You could also use data=json.dumps(payload), but json= is cleaner.
Every response object gives you several ways to access the data:
The .raise_for_status() method is important for production code. Without it, a failed API call silently returns a response with an error status code, and you might not notice until much later when your code tries to use the missing data.
Every HTTP request should have a timeout. Without one, a misbehaving server can hang your program indefinitely:
In AI applications, LLM API calls can legitimately take 30 to 60 seconds for long responses, so set your read timeout accordingly. But the connect timeout should be short, 5 to 10 seconds, because if the server is not reachable, waiting longer will not help.
requests is battle-tested and excellent for synchronous code. But if you are building an application that needs to make multiple API calls concurrently, like calling an LLM and a vector database at the same time, you will want async support.
That is where httpx comes in.
The synchronous API is nearly identical to requests, so switching is painless. The real payoff comes with async support:
We will dive deep into async Python in the next chapter. For now, just know that httpx exists and that it is the library you will graduate to when you need concurrency. Many AI SDKs (like Anthropic's Python client) are built on top of httpx internally.
PDFs are one of the most common document types you will encounter in RAG pipelines. Company reports, research papers, legal documents, technical manuals. Extracting clean text from them is a recurring task.
PyMuPDF (imported as fitz) is fast and handles most PDFs well:
The page-by-page extraction is deliberate. In RAG systems, you almost always want to know which page a chunk of text came from, so you can cite sources. Dumping the entire PDF into one string loses that information.
Not all PDFs contain selectable text. Scanned documents are essentially images wrapped in a PDF container, and get_text() returns an empty string for those. For scanned PDFs, you need OCR (Optical Character Recognition).
Libraries like pytesseract or cloud services like Google Cloud Vision can handle this, but that is beyond our scope here. The key thing to know is: always check if your extraction returned meaningful text before feeding it into a pipeline.
Web scraping is how you build training datasets, gather reference material, and enrich your RAG knowledge base with online content. BeautifulSoup is the standard library for parsing HTML in Python.
The get_text(separator="\n", strip=True) call is particularly useful. It extracts all text from an element and its children, joining them with newlines and stripping whitespace. This gives you clean, readable text from messy HTML, which is exactly what you want before feeding content to an LLM or embedding model.
A common pattern in AI data collection is to start with a list of URLs, scrape each page for its text content, and then save the results as a JSONL file that your pipeline can ingest later. The combination of requests + BeautifulSoup + json + pathlib covers that entire workflow.
Everything we have covered so far loads entire files into memory. That is fine for a 50KB JSON config or a 2MB PDF, but AI datasets can be enormous. A JSONL file with millions of training examples might be 10GB. Reading that into memory all at once will crash your program.
The simplest approach for large text files is to read them line by line:
The key insight here is that Python's file iterator reads one line at a time. The entire file is never in memory. The enumerate call lets you track progress, which is a sanity saver when processing millions of records.
For binary files or when you need to process data in fixed-size chunks:
This generator pattern is memory-efficient because it yields one chunk at a time. You will see this pattern in model downloading, dataset streaming, and file upload utilities throughout the AI ecosystem.
Now that you have all the building blocks, let's see how they fit together in a realistic AI workflow. The following diagram shows a typical data ingestion pipeline, the kind you would build for a RAG system or a fine-tuning dataset.
Every source type has its own parser, but they all converge on the same validation and chunking step. The output is a uniform format (usually JSONL) that downstream components, like embedding models or LLM fine-tuning scripts, can consume without caring where the data originally came from.
Here is a simplified version of that pipeline in code:
This is a starting point. A production pipeline would add error handling, logging, deduplication, and text chunking. But the core pattern, find files, parse them, normalize the output, write to a standard format, stays the same.