{"title":"Working with Files and Data","description":"","content":"Most AI systems are not just model calls. A large part of the work is moving data into the right shape: loading documents, normalizing JSON, reading logs, saving evaluation results, calling APIs, extracting PDF text, and keeping enough metadata to debug the result later.\n\nThis chapter covers the file and data-handling patterns that show up in practical AI work: `pathlib`, context managers, JSON and JSONL, CSV, HTTP clients, PDF extraction, HTML parsing, and large-file processing. The focus is simple: write code that behaves predictably when files are large, text is multilingual, network calls fail, and someone needs to trace an answer back to its source.\n\n---\n\n# pathlib: Treat Paths as Objects\n\nPython's `os.path` functions like `os.path.join()` and `os.path.exists()` still work. But `pathlib` is usually clearer, because a path becomes an object with methods and properties instead of a string passed through helper functions.\n\n`pathlib`, available in the standard library, gives you `Path` objects that work across operating systems and compose naturally.\n\n### Creating and Joining Paths\n\n\n**main.py**\n\n```python\nfrom pathlib import Path\n\n# Create a Path object\ndata_dir = Path(\"data\")\ndocuments_dir = data_dir / \"documents\" # The / operator joins paths\nconfig_file = data_dir / \"config.json\"\n\nprint(documents_dir) # data/documents\nprint(config_file) # data/config.json\n```\n\n\nThat `/` operator is not doing division here. `pathlib` uses it for path joining, which reads more clearly than `os.path.join(\"data\", \"documents\")`. It also handles the right path separator for the operating system.\n\n### Useful Path Properties\n\nEvery `Path` object gives you easy access to parts of the path:\n\n\n**main.py**\n\n```python\nfrom pathlib import Path\n\npath = Path(\"/home/user/datasets/train.csv\")\n\nprint(path.stem) # train (filename without extension)\nprint(path.suffix) # .csv (file extension)\nprint(path.parent) # /home/user/datasets (parent directory)\nprint(path.name) # train.csv (full filename)\n```\n\n\nThese are properties, not method calls, so there are no parentheses. `.stem` and `.suffix` are useful when processing batches of files. For example, you can turn `report.pdf` into `report.txt` after extracting text.\n\n### Reading and Writing Files\n\nFor small files, `pathlib` gives you one-liner convenience methods to work with files:\n\n\n**main.py**\n\n```python\nfrom pathlib import Path\n\n# Write text to a file\nconfig_path = Path(\"config.json\")\nconfig_path.write_text('{\"model\": \"example-model\", \"temperature\": 0.7}', encoding=\"utf-8\")\n\n# Read it back\ncontent = config_path.read_text(encoding=\"utf-8\")\nprint(content) # {\"model\": \"example-model\", \"temperature\": 0.7}\n```\n\n\nThese methods handle opening and closing the file for you. They are a good fit for small files such as configuration files, prompts, and short outputs. For large files, stream with `open()` instead of reading the whole file into memory.\n\n### Checking Existence and Creating Directories\n\n\n**main.py**\n\n```python\nfrom pathlib import Path\n\ndata_dir = Path(\"output/results\")\n\n# Check if path exists\nif not data_dir.exists():\n data_dir.mkdir(parents=True, exist_ok=True) # Creates all parent dirs\n\n# Check what kind of path it is\nprint(data_dir.is_dir()) # True\nprint(data_dir.is_file()) # False\n```\n\n\nThe `parents=True` flag creates missing parent directories too. Without it, Python raises an error if a parent directory does not exist. The `exist_ok=True` flag prevents an error when the directory is already there.\n\n### Finding Files with glob\n\nWhen you need to find files matching a pattern, `pathlib` has `.glob()` and `.rglob()` built in:\n\n\n**main.py**\n\n```python\nfrom pathlib import Path\n\ndata_dir = Path(\"documents\")\n\n# Find all PDFs in the directory\npdfs = list(data_dir.glob(\"*.pdf\"))\n\n# Find all PDFs recursively (including subdirectories)\nall_pdfs = list(data_dir.rglob(\"*.pdf\"))\n\n# Find all JSON files recursively\njson_files = list(data_dir.rglob(\"*.json\"))\n\nprint(f\"Found {len(all_pdfs)} PDFs and {len(json_files)} JSON files\")\n```\n\n\nThe difference between `.glob()` and `.rglob()` is that `.rglob()` searches subdirectories recursively. Use `.rglob()` when a document loader needs to find files inside nested folders.\n\n---\n\n# Context Managers: Safe File Handling\n\n`pathlib`'s `.read_text()` and `.write_text()` are useful for small files. When you need more control, such as reading line by line, appending to a file, or working with binary data, use `open()`. Whenever you use `open()`, use a context manager.\n\nHere is the fragile approach to reading a file:\n\n\n**main.py**\n\n```python\nf = open(\"data.txt\", \"r\")\ncontent = f.read()\nf.close()\n```\n\n\nThis works only if nothing goes wrong. If `f.read()` raises an exception, `f.close()` never runs. The file handle stays open. In a long-running service or batch job, repeated leaks can hit the operating system's limit on open file descriptors.\n\nThe `with` statement fixes this by guaranteeing cleanup:\n\n\n**main.py**\n\n```python\nwith open(\"data.txt\", \"r\") as f:\n content = f.read()\n# f.close() is called automatically, even if an exception occurs\n```\n\n\nNo matter what happens inside the `with` block, whether the code succeeds, raises an exception, or returns early, the file gets closed. Use context managers for resources that need cleanup.\n\n### Writing Your Own Context Manager\n\nYou may want the same guarantee for your own resources, such as a database connection, a temporary directory, or a timer. Python's `contextlib` module makes this straightforward:\n\n\n**main.py**\n\n```python\nfrom contextlib import contextmanager\nimport time\n\n@contextmanager\ndef timer(label):\n start = time.time()\n try:\n yield\n finally:\n # The 'finally' block runs on exit, even if the block raised\n elapsed = time.time() - start\n print(f\"{label}: {elapsed:.2f} seconds\")\n\nwith timer(\"PDF processing\"):\n time.sleep(0.5)\n# Prints: PDF processing: 0.5 seconds\n```\n\n\nEverything before `yield` runs when entering the `with` block. Everything after it runs when exiting. Wrapping `yield` in `try`/`finally` guarantees the cleanup code runs even if the block raises an exception. This same idea shows up in AI code when you manage temporary files, experiment runs, model clients, or expensive local resources.\n\n---\n\n# Reading and Writing JSON\n\nJSON is the common interchange format for AI systems. Provider APIs, tool calls, configuration files, evaluation sets, and logs often use JSON or JSONL, where each line is a separate JSON object.\n\n### The Basics: load, dump, loads, dumps\n\nPython's `json` module has four core functions. The names are easy to mix up at first:\n\n\n**main.py**\n\n```python\nimport json\n\n# String operations (the 's' stands for 'string')\ndata = json.loads('{\"model\": \"example-model\", \"temperature\": 0.7}') # string -> dict\ntext = json.dumps(data) # dict -> string\n\n# File operations (no 's')\nwith open(\"config.json\", \"w\", encoding=\"utf-8\") as f:\n json.dump(data, f) # dict -> file\n\nwith open(\"config.json\", \"r\", encoding=\"utf-8\") as f:\n data = json.load(f) # file -> dict\n```\n\n\nThe mnemonic: `loads` and `dumps` work with **s**trings. `load` and `dump` work with files.\n\n### Pretty Printing\n\nCompact JSON is good for machines, but not always pleasant for people to read:\n\n\n**main.py**\n\n```python\nimport json\n\ndata = {\"model\": \"example-model\", \"parameters\": {\"temperature\": 0.7, \"max_tokens\": 1000}}\n\n# Compact (good for APIs, bad for humans)\nprint(json.dumps(data))\n# {\"model\": \"example-model\", \"parameters\": {\"temperature\": 0.7, \"max_tokens\": 1000}}\n\n# Pretty (good for config files and debugging)\nprint(json.dumps(data, indent=2))\n# {\n# \"model\": \"example-model\",\n# \"parameters\": {\n# \"temperature\": 0.7,\n# \"max_tokens\": 1000\n# }\n# }\n```\n\n\nUse `indent=2` for config files, debugging output, and evaluation results that humans will inspect. Skip it for high-volume machine-only data, since extra whitespace adds up.\n\n### Handling Encoding Issues\n\nWhen working with multilingual data or special characters, encoding defaults can surprise you. Specify UTF-8 explicitly:\n\n\n**main.py**\n\n```python\nfrom pathlib import Path\nimport json\n\n# Safe approach for international text\ndata = {\"text\": \"Caf\\u00e9 AI \\u2014 \\u65e5\\u672c\\u8a9e\\u30c6\\u30b9\\u30c8\"}\n\nPath(\"output.json\").write_text(\n json.dumps(data, indent=2, ensure_ascii=False),\n encoding=\"utf-8\"\n)\n```\n\n\nThe `ensure_ascii=False` flag tells `json.dumps` to write Unicode characters directly instead of escaping them as `\\uXXXX`. This keeps the file readable for humans.\n\n### JSONL: JSON Lines\n\nAI workflows often use the JSONL format, where each line is a separate JSON object. This format is common for datasets, evaluation sets, and log files:\n\n\n**main.py**\n\n```python\nimport json\n\n# Writing JSONL\nrecords = [\n {\"prompt\": \"What is Python?\", \"completion\": \"A programming language.\"},\n {\"prompt\": \"What is RAG?\", \"completion\": \"Retrieval-Augmented Generation.\"},\n]\n\nwith open(\"dataset.jsonl\", \"w\", encoding=\"utf-8\") as f:\n for record in records:\n f.write(json.dumps(record, ensure_ascii=False) + \"\\n\")\n\n# Reading JSONL\nwith open(\"dataset.jsonl\", \"r\", encoding=\"utf-8\") as f:\n for line in f:\n if line.strip():\n record = json.loads(line)\n print(record)\n```\n\n\nJSONL has a practical advantage over one large JSON array: you can append new records without rewriting the whole file, and you can process records one at a time. That matters when a dataset has millions of rows.\n\n---\n\n# Reading and Writing CSV\n\nCSV files show up less often than JSON in many AI systems, but they are still common for tabular datasets, evaluation metrics, labeling exports, and spreadsheet data.\n\n### DictReader and DictWriter\n\nThe `csv` module's `DictReader` is often easier to use than the basic `reader`, because it gives you dictionaries keyed by column names instead of positional lists:\n\n\n**main.py**\n\n```python\nimport csv\n\n# Writing CSV\nresults = [\n {\"model\": \"model-a\", \"accuracy\": 0.92, \"latency_ms\": 450},\n {\"model\": \"model-b\", \"accuracy\": 0.89, \"latency_ms\": 380},\n {\"model\": \"model-c\", \"accuracy\": 0.78, \"latency_ms\": 120},\n]\n\nwith open(\"benchmark.csv\", \"w\", newline=\"\", encoding=\"utf-8\") as f:\n writer = csv.DictWriter(f, fieldnames=[\"model\", \"accuracy\", \"latency_ms\"])\n writer.writeheader()\n writer.writerows(results)\n\n# Reading CSV\nwith open(\"benchmark.csv\", \"r\", newline=\"\", encoding=\"utf-8\") as f:\n reader = csv.DictReader(f)\n for row in reader:\n print(f\"{row['model']}: accuracy={row['accuracy']}\")\n```\n\n\nOne thing to watch: `DictReader` returns all values as strings, even numbers. Cast them yourself, for example `float(row[\"accuracy\"])`, if you need to do math. Libraries like `pandas` can infer types for you, but for simple jobs, the built-in `csv` module is lighter and has no extra dependency.\n\n---\n\n# Making HTTP Requests with requests\n\nMost AI applications call external APIs: model providers, embedding services, vector databases, observability systems, and internal data services. The `requests` library is a common choice for synchronous HTTP code.\n\n### GET and POST Requests\n\n\n**main.py**\n\n```python\nimport requests\n\n# Simple GET request\nresponse = requests.get(\"https://api.example.com/models\")\nprint(response.status_code) # 200\ndata = response.json() # Parse JSON response body\n\n# POST request with JSON body (typical for LLM APIs)\nresponse = requests.post(\n \"https://api.example.com/v1/chat/completions\",\n headers={\n \"Authorization\": \"Bearer sk-your-api-key\",\n \"Content-Type\": \"application/json\",\n },\n json={\n \"model\": \"example-model\",\n \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}],\n },\n)\n```\n\n\nThe `json=` parameter serializes your dictionary to JSON and sends the request with a JSON content type. You could also use `data=json.dumps(payload)`, but `json=` is clearer for normal API calls.\n\n### Response Handling\n\nEvery response object gives you several ways to access the data:\n\n\n**main.py**\n\n```python\nresponse = requests.get(\"https://api.example.com/data\")\n\n# Status code\nprint(response.status_code) # 200, 404, 500, etc.\n\n# Raise an exception for 4xx/5xx responses\nresponse.raise_for_status() # Raises requests.HTTPError if status >= 400\n\n# Response body\ndata = response.json() # Parse as JSON\ntext = response.text # Raw text\nraw = response.content # Raw bytes (useful for binary data like images)\n```\n\n\nThe `.raise_for_status()` method is important in real code. Without it, a failed API call still returns a response object, and your program may not fail until later when it tries to use missing or malformed data. Also remember that `.json()` can raise an error if the response body is not valid JSON.\n\n### Timeouts: Do Not Skip These\n\nEvery HTTP request should have a timeout. Without one, a slow or unreachable server can leave your program waiting indefinitely:\n\n\n**main.py**\n\n```python\n# Always set a timeout\nresponse = requests.get(\n \"https://api.example.com/data\",\n timeout=30 # seconds\n)\n\n# You can also set connect and read timeouts separately\nresponse = requests.get(\n \"https://api.example.com/data\",\n timeout=(5, 30) # 5 seconds to connect, 30 seconds to read\n)\n```\n\n\nIn AI applications, long model responses can legitimately take a while, so set the read timeout based on the operation. The connect timeout should usually be shorter, often 5 to 10 seconds, because a server that cannot be reached rarely becomes reachable by waiting much longer on the same attempt.\n\n---\n\n# httpx: Sync and Async HTTP\n\n`requests` is mature and reliable for synchronous code. If one workflow needs to make several network calls concurrently, such as calling a model API and a vector database at the same time, async support becomes useful.\n\nThat is where `httpx` comes in.\n\n\n**main.py**\n\n```python\nimport httpx\n\n# Synchronous API with a familiar shape\nresponse = httpx.get(\"https://api.example.com/data\", timeout=30)\ndata = response.json()\n\n# POST with headers and a JSON body\nresponse = httpx.post(\n \"https://api.example.com/v1/embeddings\",\n headers={\"Authorization\": \"Bearer sk-your-key\"},\n json={\"input\": \"Hello world\", \"model\": \"text-embedding-3-small\"},\n timeout=30,\n)\n```\n\n\nThe synchronous API looks similar to `requests`, but the main reason many teams choose `httpx` is async support:\n\n\n**main.py**\n\n```python\nimport httpx\nimport asyncio\n\nasync def fetch_embeddings(texts):\n async with httpx.AsyncClient(timeout=30) as client:\n tasks = [\n client.post(\n \"https://api.example.com/v1/embeddings\",\n headers={\"Authorization\": \"Bearer sk-your-key\"},\n json={\"input\": text, \"model\": \"text-embedding-3-small\"},\n )\n for text in texts\n ]\n responses = await asyncio.gather(*tasks)\n for response in responses:\n response.raise_for_status()\n return [r.json() for r in responses]\n```\n\n\nFor now, remember the decision point: `requests` is fine for straightforward synchronous scripts. `httpx.AsyncClient` is a better fit when one workflow needs many concurrent network calls.\n\n---\n\n# Reading PDFs\n\nPDFs are one of the most common document types in RAG pipelines: company reports, research papers, contracts, policies, and technical manuals. Extracting useful text from them is a recurring task, but PDFs are not always easy to parse cleanly.\n\n### PyMuPDF (fitz)\n\n`PyMuPDF` (imported as `fitz`) is fast and handles many text-based PDFs well:\n\n\n**main.py**\n\n```python\nimport fitz # pip install pymupdf\n\ndef extract_text_from_pdf(pdf_path):\n \"\"\"Extract text from all pages of a PDF.\"\"\"\n pages = []\n with fitz.open(pdf_path) as doc:\n for page_num, page in enumerate(doc, start=1):\n text = page.get_text()\n pages.append({\n \"page\": page_num,\n \"text\": text.strip(),\n })\n return pages\n\n# Usage\npages = extract_text_from_pdf(\"research_paper.pdf\")\nfor p in pages:\n print(f\"Page {p['page']}: {len(p['text'])} characters\")\n```\n\n\nThe page-by-page extraction is deliberate. In RAG systems, you usually want to know which page a chunk of text came from, so you can cite sources and debug retrieval results. Dumping the entire PDF into one string loses that information.\n\n### Handling Scanned PDFs\n\nNot all PDFs contain selectable text. Scanned documents are often page images wrapped in a PDF container, and `get_text()` may return little or no text for those. For scanned PDFs, you need OCR, which stands for Optical Character Recognition.\n\nLibraries like `pytesseract` or cloud OCR services can handle this, but OCR brings its own trade-offs around accuracy, cost, latency, and privacy. Always check whether extraction returned meaningful text before feeding it into a pipeline.\n\n\n**main.py**\n\n```python\npages = extract_text_from_pdf(\"scanned_document.pdf\")\nnon_empty = [p for p in pages if len(p[\"text\"]) > 50]\n\nif len(non_empty) < len(pages) * 0.5:\n print(\"Warning: This PDF may be scanned. OCR may be needed.\")\n```\n\n\n---\n\n# Parsing HTML with BeautifulSoup\n\nHTML parsing is useful when you need to ingest web pages into a search index or RAG knowledge base. Respect `robots.txt`, licenses, authentication boundaries, and site terms. `BeautifulSoup` is a common library for parsing HTML in Python.\n\n\n**main.py**\n\n```python\nfrom bs4 import BeautifulSoup # pip install beautifulsoup4\nimport requests\n\n# Fetch a web page\nresponse = requests.get(\"https://example.com/article\", timeout=30)\nresponse.raise_for_status()\nsoup = BeautifulSoup(response.text, \"html.parser\")\n\n# Extract the page title\ntitle_tag = soup.find(\"title\")\ntitle = title_tag.get_text(strip=True) if title_tag else None\n\n# Extract all paragraph text\nparagraphs = [p.text.strip() for p in soup.find_all(\"p\")]\n\n# Extract text from a specific div\nmain_content = soup.find(\"div\", class_=\"article-body\")\nif main_content:\n article_text = main_content.get_text(separator=\"\\n\", strip=True)\n```\n\n\nThe `get_text(separator=\"\\n\", strip=True)` call is useful because it extracts text from an element and its children, joins pieces with newlines, and strips extra whitespace. It will not recover the article structure from every page, but it gives you a reasonable starting point before sending content to an LLM or embedding model.\n\n### Extracting Links and Metadata\n\n\n**main.py**\n\n```python\n# Extract all links\nlinks = []\nfor a_tag in soup.find_all(\"a\", href=True):\n links.append({\n \"text\": a_tag.text.strip(),\n \"url\": a_tag[\"href\"],\n })\n\n# Extract meta description\nmeta = soup.find(\"meta\", attrs={\"name\": \"description\"})\ndescription = meta[\"content\"] if meta else None\n```\n\n\nA common pattern in AI data collection is to start with a list of URLs, fetch each allowed page, extract text and metadata, and save the results as JSONL for later ingestion. The combination of `requests`, `BeautifulSoup`, `json`, and `pathlib` covers that basic workflow.\n\n---\n\n# Working with Large Files\n\nSome examples above load whole files into memory. That is fine for a small config file or a short document, but AI datasets can be much larger. A JSONL file with millions of examples might be many gigabytes. Reading that into memory all at once can slow your machine down or crash the process.\n\n### Line-by-Line Reading\n\nThe simplest approach for large text files is to read them line by line:\n\n\n**main.py**\n\n```python\nimport json\n\ndef iter_jsonl(file_path):\n \"\"\"Yield JSONL records without loading the whole file into memory.\"\"\"\n with open(file_path, \"r\", encoding=\"utf-8\") as f:\n for i, line in enumerate(f):\n if not line.strip():\n continue\n record = json.loads(line)\n if (i + 1) % 100_000 == 0:\n print(f\"Processed {i + 1:,} records...\")\n yield record\n\nfor record in iter_jsonl(\"large_dataset.jsonl\"):\n preview = record[\"text\"][:100]\n # Process or write each result here, instead of accumulating all previews.\n```\n\n\nPython's file iterator reads one line at a time. The entire file is never loaded into memory. The `enumerate` call lets you track progress, which helps when processing millions of records.\n\n### Chunked Reading for Binary Files\n\nFor binary files or when you need to process data in fixed-size chunks:\n\n\n**main.py**\n\n```python\ndef read_in_chunks(file_path, chunk_size=8192):\n \"\"\"Read a file in chunks of chunk_size bytes.\"\"\"\n with open(file_path, \"rb\") as f:\n while True:\n chunk = f.read(chunk_size)\n if not chunk:\n break\n yield chunk\n\n# Usage: calculate file hash without loading into memory\nimport hashlib\n\nhasher = hashlib.sha256()\nfor chunk in read_in_chunks(\"large_model.bin\"):\n hasher.update(chunk)\nprint(f\"SHA-256: {hasher.hexdigest()}\")\n```\n\n\nThis generator pattern is memory-efficient because it yields one chunk at a time. The same approach shows up in model downloads, dataset streaming, and file upload utilities.\n\n---\n\n# Putting It All Together: A Data Ingestion Pipeline\n\nNow put the pieces together. The following diagram shows a typical data ingestion pipeline for a RAG system or dataset preparation job.\n\n\n```mermaid\nflowchart LR\n subgraph Sources\n PDF[PDF Files]:::primary\n JSON[JSON Files]:::primary\n API[API Endpoints]:::primary\n HTML[Web Pages]:::primary\n end\n\n subgraph Parse\n FITZ[PyMuPDF]:::orange\n JPARSE[json module]:::orange\n REQ[requests / httpx]:::orange\n BS4[BeautifulSoup]:::orange\n end\n\n subgraph Process\n VALIDATE[Validate & Clean]:::teal\n CHUNK[Chunk Text]:::teal\n end\n\n STORE[(Output Store
JSONL / CSV)]:::green\n\n PDF --> FITZ\n JSON --> JPARSE\n API --> REQ\n HTML --> BS4\n\n FITZ --> VALIDATE\n JPARSE --> VALIDATE\n REQ --> VALIDATE\n BS4 --> VALIDATE\n\n VALIDATE --> CHUNK\n CHUNK --> STORE\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n```\n\n\nEvery source type has its own parser, but they all converge on the same validation and chunking step. The output is a uniform format, often JSONL, that downstream components can consume without caring where the data originally came from.\n\nHere is a simplified version of that pipeline in code:\n\n\n**main.py**\n\n```python\nimport json\nimport fitz\nimport requests\nfrom pathlib import Path\nfrom bs4 import BeautifulSoup\n\ndef load_pdf(path):\n \"\"\"Extract text from a PDF file.\"\"\"\n texts = []\n with fitz.open(path) as doc:\n for page_number, page in enumerate(doc, start=1):\n text = page.get_text().strip()\n if text:\n texts.append({\n \"source\": str(path),\n \"page\": page_number,\n \"text\": text,\n })\n return texts\n\ndef load_json(path):\n \"\"\"Load records from a JSON file.\"\"\"\n data = json.loads(Path(path).read_text(encoding=\"utf-8\"))\n if isinstance(data, list):\n return [\n {\"source\": str(path), \"text\": json.dumps(item, ensure_ascii=False)}\n for item in data\n ]\n return [{\"source\": str(path), \"text\": json.dumps(data, ensure_ascii=False)}]\n\ndef load_from_url(url):\n \"\"\"Fetch and parse a web page.\"\"\"\n response = requests.get(url, timeout=30)\n response.raise_for_status()\n soup = BeautifulSoup(response.text, \"html.parser\")\n text = soup.get_text(separator=\"\\n\", strip=True)\n return [{\"source\": url, \"text\": text}]\n\ndef ingest_documents(input_dir, output_path):\n \"\"\"Ingest all documents from a directory into a JSONL file.\"\"\"\n input_dir = Path(input_dir)\n output_path = Path(output_path)\n output_path.parent.mkdir(parents=True, exist_ok=True)\n total_records = 0\n\n with open(output_path, \"w\", encoding=\"utf-8\") as f:\n for path in input_dir.rglob(\"*\"):\n if not path.is_file():\n continue\n\n suffix = path.suffix.lower()\n\n if suffix == \".pdf\":\n loader = load_pdf\n elif suffix == \".json\":\n loader = load_json\n else:\n continue\n\n try:\n for record in loader(path):\n f.write(json.dumps(record, ensure_ascii=False) + \"\\n\")\n total_records += 1\n except Exception as exc:\n print(f\"Skipped {path}: {exc}\")\n\n print(f\"Ingested {total_records} records to {output_path}\")\n\n# Usage\ningest_documents(\"data/raw_documents\", \"data/processed/corpus.jsonl\")\n```\n\n\nThis is a starting point. A production pipeline would usually add retries, structured logging, deduplication, text chunking, metadata validation, and better error reporting. The core pattern stays the same: find files, parse them, normalize the output, and write a format downstream systems can consume.\n\n---\n\n# Quiz\n\n---\n\n### References\n\n- [pathlib documentation (Python official docs)](https://docs.python.org/3/library/pathlib.html)\n- [requests library documentation](https://docs.python-requests.org/en/latest/)\n- [httpx documentation](https://www.python-httpx.org/)\n- [PyMuPDF (fitz) documentation](https://pymupdf.readthedocs.io/en/latest/)\n- [BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)\n- [Real Python: Working with Files in Python](https://realpython.com/working-with-files-in-python/)","pageType":"ai-engineering"}

Get Premium