{"title":"Python Essentials for AI","description":"","content":"Python is the working language for much of AI engineering. You will use it for data preparation, evaluation scripts, model-serving code, RAG pipelines, and the small tools that connect systems together.\n\nThis chapter covers the Python features that show up repeatedly in AI codebases: collections, comprehensions, unpacking, formatting, slicing, built-in iteration tools, text handling, and a few everyday idioms. The goal is not to memorize every Python feature. It is to build habits that keep your code clear when you are working with real data and external APIs.\n\n---\n\n# Choosing the Right Data Structure\n\nPython's core collection types are simple, but choosing the right one matters. A good choice means fewer conversions, fewer hidden assumptions about ordering, and fewer bugs around mutation.\n\nHere is how they compare:\n\n\n| Structure | Ordered? | Mutable? | Duplicates? | AI Use Case |\n|-----------|----------|----------|-------------|-------------|\n| `list` | Yes | Yes | Yes | Storing embeddings, token sequences, batch results |\n| `dict` | Yes (3.7+) | Yes | Keys: No | Model configs, API responses, token-to-id mappings |\n| `set` | No | Yes | No | Vocabulary deduplication, stopword filtering |\n| `tuple` | Yes | No | Yes | Immutable coordinates, function return values, dict keys |\n\n\nHere is how each one commonly appears in AI projects.\n\n### Lists: Ordered Sequence\n\nLists are ordered, mutable sequences. In AI code, lists hold everything from raw token sequences to batches of embeddings.\n\n\n**main.py**\n\n```python\n# A list of token IDs\ntoken_ids = [101, 7592, 1010, 2129, 2024, 2017, 102]\nprint(token_ids)\n\n# A batch of embedding vectors\nembeddings = [\n [0.12, -0.34, 0.56, 0.78],\n [0.91, -0.23, 0.45, 0.67],\n [0.33, -0.11, 0.89, 0.22],\n]\nprint(embeddings)\n\nresults = []\nresults.append({\"prompt\": \"What is RAG?\", \"response\": \"...\", \"score\": 0.85})\nresults.append({\"prompt\": \"Explain embeddings\", \"response\": \"...\", \"score\": 0.92})\nprint(results)\n```\n\n\n### Dicts: Configuration and Mapping\n\nDicts are key-value stores. They are everywhere in AI code: model settings, API payloads, token vocabularies, lookup tables, and parsed JSON responses.\n\n\n**main.py**\n\n```python\nmodel_config = {\n \"model\": \"large-model\",\n \"temperature\": 0.7,\n \"max_output_tokens\": 1024,\n \"top_p\": 0.9,\n}\nprint(model_config)\n\nvocab = {\"hello\": 7592, \"world\": 2088, \"[CLS]\": 101, \"[SEP]\": 102}\nprint(vocab)\n\ntoken_id = vocab.get(\"unknown_word\", 0)\nprint(token_id)\n```\n\n\n### Sets: Deduplication and Fast Lookup\n\nSets are unordered collections of unique elements. Average-case membership checks are constant time, which matters when you filter thousands of tokens or document IDs.\n\n\n**main.py**\n\n```python\nstopwords = {\"the\", \"a\", \"an\", \"is\", \"at\", \"in\", \"on\", \"and\", \"or\"}\n\ntokens = [\"the\", \"cat\", \"is\", \"on\", \"the\", \"mat\"]\nfiltered = [t for t in tokens if t not in stopwords]\nprint(filtered) # ['cat', 'mat']\n\nunique_words = set([\"apple\", \"banana\", \"apple\", \"cherry\", \"banana\"])\nprint(unique_words) # Print order is not guaranteed\n```\n\n\nIf you need both uniqueness and order, a common pattern is `list(dict.fromkeys(items))`. This preserves insertion order while removing duplicates.\n\n### Tuples: Immutable and Hashable\n\nTuples are like lists but immutable. You cannot change them after creation. This makes them useful as dict keys (lists cannot be dict keys because they are mutable) and as return values from functions.\n\n\n**main.py**\n\n```python\nimage_shape = (3, 224, 224) # channels, height, width\n\nembedding_cache = {}\nembedding_cache[(0.1, 0.2, 0.3)] = \"document_42\"\n\ndef evaluate_model(predictions, labels):\n precision = 0.91\n recall = 0.87\n f1 = 0.89\n return precision, recall, f1\n\npredictions = [\"positive\", \"negative\", \"positive\"]\nlabels = [\"positive\", \"negative\", \"negative\"]\np, r, f1 = evaluate_model(predictions, labels)\nprint(p, r, f1)\n```\n\n\nReturning multiple values and unpacking them at the call site is routine Python. Use it for small, tightly related results. Once the return value has more structure or survives beyond a few lines, prefer a dataclass or Pydantic model with named fields.\n\n---\n\n# Comprehensions: Data Processing in One Line\n\nComprehensions let you transform or filter a collection in a single expression. In AI work, they are common in preprocessing, feature extraction, evaluation summaries, and batch assembly.\n\n### List Comprehensions\n\nThe basic pattern is `[expression for item in iterable if condition]`.\n\n\n**main.py**\n\n```python\nraw_scores = [0.2, 0.8, 0.5, 0.1, 0.9]\nmin_val, max_val = min(raw_scores), max(raw_scores)\nnormalized = [(s - min_val) / (max_val - min_val) for s in raw_scores]\nprint(normalized)\n\npredictions = [\n {\"label\": \"positive\", \"score\": 0.95},\n {\"label\": \"negative\", \"score\": 0.42},\n {\"label\": \"positive\", \"score\": 0.88},\n {\"label\": \"neutral\", \"score\": 0.31},\n]\nconfident = [p for p in predictions if p[\"score\"] > 0.5]\nprint(confident)\n\nlabels = [p[\"label\"] for p in confident]\nprint(labels) # ['positive', 'positive']\n```\n\n\n### Dict Comprehensions\n\nSame idea, but the result is a dictionary. This is useful for building lookup tables and inverting mappings.\n\n\n**main.py**\n\n```python\nwords = [\"hello\", \"world\", \"embedding\", \"vector\", \"token\"]\nword_to_idx = {word: idx for idx, word in enumerate(words)}\nprint(word_to_idx)\n\nidx_to_word = {idx: word for word, idx in word_to_idx.items()}\nprint(idx_to_word)\n\nconfig = {\n \"model\": \"large-model\",\n \"temperature\": 0.7,\n \"max_output_tokens\": 1024,\n \"stream\": True,\n}\nnumeric_params = {\n k: v\n for k, v in config.items()\n if isinstance(v, (int, float)) and not isinstance(v, bool)\n}\nprint(numeric_params)\n```\n\n\n### Set Comprehensions\n\nSet comprehensions are useful when you want unique values from a collection.\n\n\n**main.py**\n\n```python\ndataset = [\n {\"text\": \"Great product!\", \"label\": \"positive\"},\n {\"text\": \"Terrible service\", \"label\": \"negative\"},\n {\"text\": \"It was okay\", \"label\": \"neutral\"},\n {\"text\": \"Love it!\", \"label\": \"positive\"},\n]\nunique_labels = {item[\"label\"] for item in dataset}\nprint(unique_labels) # Print order varies\n```\n\n\nDo not turn comprehensions into puzzles. A single filter or transform is usually clear. A double-nested comprehension can be fine when it reads naturally. If you need three levels, side effects, or exception handling, use a regular loop. Preprocessing code should be easy to review.\n\n---\n\n# Tuple Unpacking and Multiple Returns\n\nUnpacking appears everywhere in Python AI code: evaluation loops, dataset iteration, metric returns, and batch processing.\n\n### Basic Unpacking\n\n\n**main.py**\n\n```python\npoint = (3, 7)\nx, y = point\n\npairs = [(\"cat\", 0.95), (\"dog\", 0.82), (\"bird\", 0.71)]\nfor animal, score in pairs:\n print(f\"{animal}: {score}\")\n\nmodel_name, _, version = (\"model-family\", \"unused\", \"v2\")\nprint(model_name, version)\n```\n\n\n### Star Unpacking\n\nThe `*` operator captures \"the rest\" into a list.\n\n\n**main.py**\n\n```python\nscores = [0.95, 0.88, 0.76, 0.71, 0.65]\nbest, *middle, worst = scores\nprint(best, middle, worst)\n\nlines = [\"name,score,label\", \"doc1,0.95,positive\", \"doc2,0.42,negative\"]\nheader, *data_rows = lines\nprint(header)\nprint(data_rows)\n```\n\n\n### Functions Returning Multiple Values\n\nPython functions commonly return tuples, and callers unpack them directly. This pattern is common in training loops, evaluation code, and small utility functions.\n\n\n**main.py**\n\n```python\ndef train_epoch():\n loss = 0.342\n accuracy = 0.891\n num_samples = 1024\n return loss, accuracy, num_samples\n\nloss, acc, n = train_epoch()\nprint(f\"Loss: {loss:.3f}, Accuracy: {acc:.1%}, Samples: {n}\")\n```\n\n\n---\n\n# F-Strings: Clean Formatting\n\nF-strings, short for formatted string literals, are the standard way to interpolate values into strings in modern Python.\n\n\n**main.py**\n\n```python\nmodel = \"large-model\"\ntokens = 1523\ncost = 0.00457\n\nprint(f\"Model: {model}, Tokens: {tokens}, Cost: ${cost}\")\n\nprint(f\"Cost: ${cost:.4f}\") # 4 decimal places: $0.0046\nprint(f\"Accuracy: {0.8912:.1%}\") # Percentage: 89.1%\nprint(f\"Tokens: {tokens:,}\") # Thousands separator: 1,523\nprint(f\"{'Model':<20} {'Score':>10}\") # Alignment example\n\nscores = [0.9, 0.85, 0.78]\nprint(f\"Average: {sum(scores)/len(scores):.2f}\") # Average: 0.84\n```\n\n\nOlder approaches like `%` formatting and `.format()` still work and appear in older code. For new application code, f-strings are usually clearer and easier to scan.\n\n---\n\n# Slicing: Working with Sequences\n\nSlicing extracts portions of lists, strings, arrays, and tensors. The syntax is `sequence[start:stop:step]`, where `start` is inclusive and `stop` is exclusive.\n\n\n**main.py**\n\n```python\ntokens = [\"[CLS]\", \"how\", \"does\", \"RAG\", \"work\", \"?\", \"[SEP]\"]\n\ncontent = tokens[1:-1]\nprint(content)\n\nfirst_three = tokens[:3]\nprint(first_three)\n\nlast_two = tokens[-2:]\nprint(last_two)\n\nevery_other = tokens[::2]\nprint(every_other)\n\nreversed_tokens = tokens[::-1]\nprint(reversed_tokens)\n```\n\n\n### Why Slicing Matters for AI\n\nWhen you work with NumPy arrays and PyTorch tensors, slicing becomes essential. The same basic syntax carries over:\n\n\n**main.py**\n\n```python\nimport numpy as np\n\nembeddings = np.array([\n [0.1, 0.2, 0.3, 0.4],\n [0.5, 0.6, 0.7, 0.8],\n [0.9, 1.0, 1.1, 1.2],\n])\n\nbatch = embeddings[:2]\nprint(batch)\n\nreduced = embeddings[:, :3]\nprint(reduced)\n```\n\n\nYou do not need to fully understand NumPy yet. The slicing syntax you learn here transfers directly to the numerical computing libraries you will use later.\n\n### String Slicing\n\nStrings support the same slicing syntax. This is useful for truncating prompts, extracting prefixes, or working with fixed-format text.\n\n\n**main.py**\n\n```python\nprompt = \"Explain how transformers work in simple terms\"\n\ntruncated = prompt[:20]\nprint(truncated)\n\nfilename = \"model_weights.safetensors\"\nextension = filename[filename.rfind(\".\"):]\nprint(extension)\n```\n\n\n---\n\n# The Walrus Operator `:=`\n\nThe walrus operator (`:=`), introduced in Python 3.8, assigns a value as part of an expression.\n\nUse it sparingly. It is helpful when it removes duplicated work without making the condition harder to read.\n\n\n**main.py**\n\n```python\nwhile (chunk := stream.next_chunk()) is not None:\n print(chunk.text, end=\"\", flush=True)\n\nwhile (line := file.readline()):\n tokens = line.split()\n process(tokens)\n\nresults = []\nfor text in documents:\n if (score := compute_similarity(text, query)) > 0.8:\n results.append({\"text\": text, \"score\": score})\n```\n\n\nWithout `:=`, the last example would need to compute the score before the `if` and then use it again inside the block. That is fine too. Use the walrus operator only when it improves readability.\n\n---\n\n# Essential Python Idioms\n\nPython has built-in functions that replace manual indexing and repetitive loops. These idioms make data-processing code shorter without hiding the work.\n\n### enumerate: Loop with Index\n\nInstead of maintaining a separate counter variable, use `enumerate`:\n\n\n**main.py**\n\n```python\ndocuments = [\"intro to RAG\", \"embedding models\", \"vector databases\"]\n\nfor i in range(len(documents)):\n print(f\"Doc {i}: {documents[i]}\")\n\nfor i, doc in enumerate(documents):\n print(f\"Doc {i}: {doc}\")\n\nfor i, doc in enumerate(documents, start=1):\n print(f\"Doc {i}: {doc}\")\n```\n\n\n### zip: Iterate in Parallel\n\n`zip` pairs up elements from two or more iterables.\n\n\n**main.py**\n\n```python\nmodels = [\"large-model\", \"balanced-model\", \"small-local-model\"]\nscores = [0.92, 0.89, 0.85]\nlatencies = [1.2, 1.8, 0.6]\n\nfor model, score, latency in zip(models, scores, latencies):\n print(f\"{model}: score={score}, latency={latency}s\")\n\nmodel_scores = dict(zip(models, scores))\nprint(model_scores)\n```\n\n\n### any and all: Bulk Boolean Checks\n\nThese short-circuit through an iterable and return a single boolean. Think of `any` as \"does at least one item satisfy this?\" and `all` as \"do all items satisfy this?\"\n\n\n**main.py**\n\n```python\nscores = [0.95, 0.88, 0.42, 0.76]\n\nhas_high_score = any(s > 0.9 for s in scores) # True\nprint(has_high_score)\n\nall_passing = all(s > 0.5 for s in scores) # False (0.42 fails)\nprint(all_passing)\n\nrequired = [\"model\", \"temperature\", \"max_output_tokens\"]\nconfig = {\"model\": \"large-model\", \"temperature\": 0.7}\nmissing = [f for f in required if f not in config]\nhas_missing = any(f not in config for f in required) # True\nprint(missing)\nprint(has_missing)\n```\n\n\n### sorted with key: Custom Sorting\n\nThe `key` parameter lets you sort by derived values without changing the data itself.\n\n\n**main.py**\n\n```python\nresults = [\n {\"doc\": \"RAG tutorial\", \"score\": 0.72},\n {\"doc\": \"Vector DB guide\", \"score\": 0.95},\n {\"doc\": \"Embedding basics\", \"score\": 0.88},\n]\nranked = sorted(results, key=lambda r: r[\"score\"], reverse=True)\nprint(ranked)\n\nwords = [\"transformer\", \"RAG\", \"embedding\", \"LLM\"]\nby_length = sorted(words, key=len)\nprint(by_length)\n```\n\n\n### dict.get with Defaults\n\nUsing `.get()` with a default value is the standard pattern when a missing key has a legitimate fallback.\n\n\n**main.py**\n\n```python\nconfig = {\"model\": \"large-model\", \"temperature\": 0.7}\n\n# This would raise KeyError if the key is missing:\n# max_output_tokens = config[\"max_output_tokens\"]\n\nmax_output_tokens = config.get(\"max_output_tokens\", 1024)\nstream = config.get(\"stream\", False) # False\nprint(max_output_tokens, stream)\n```\n\n\n---\n\n# String Methods for NLP and Text Processing\n\nWhen you work with text data, you will lean on a small set of string methods. They are useful for preprocessing inputs, parsing outputs, and preparing prompts.\n\n### split and join: Tokenization Basics\n\n`split` breaks a string into a list. `join` does the reverse. They are not a replacement for a model tokenizer, but they are useful for basic text processing.\n\n\n**main.py**\n\n```python\ntext = \"How does retrieval augmented generation work?\"\ntokens = text.split()\nprint(tokens)\n\ncsv_line = \"positive,0.95,This product is great\"\nlabel, score, text = csv_line.split(\",\", maxsplit=2)\nprint(label, score, text)\n\ncleaned_tokens = [\"how\", \"does\", \"rag\", \"work\"]\nsentence = \" \".join(cleaned_tokens)\nprint(sentence)\n\nitems = [\"retrieval\", \"augmented\", \"generation\"]\nformatted = \", \".join(items)\nprint(formatted)\n```\n\n\n### strip, replace, lower: Cleaning Text\n\nMessy text is the norm in real data. These methods handle the most common cleaning tasks.\n\n\n**main.py**\n\n```python\nraw_response = \"\\n The answer is 42. \\n\\n\"\nclean = raw_response.strip()\nprint(clean)\n\nquery = \"What is RAG?\"\nnormalized = query.lower()\nprint(normalized)\n\ntext = \"Hello\\nWorld\\n\\nHow are you?\"\nsingle_line = text.replace(\"\\n\", \" \")\nprint(single_line)\n\ndirty = \" \\n HELLO, WORLD! \\t \"\nclean = dirty.strip().lower().replace(\"!\", \"\")\nprint(clean)\n```\n\n\n### startswith and endswith: Pattern Matching\n\nThese are cleaner than slicing for checking prefixes and suffixes, and they accept tuples for checking multiple patterns at once.\n\n\n**main.py**\n\n```python\nfilename = \"model_weights.safetensors\"\n\nif filename.endswith(\".safetensors\"):\n print(\"SafeTensors format\")\n\nif filename.endswith((\".pt\", \".pth\", \".safetensors\")):\n print(\"Model weights file\")\n\nmodel_files = [\"encoder.onnx\", \"ranker.pt\", \"tokenizer.json\", \"weights.safetensors\"]\nweight_files = [\n name\n for name in model_files\n if name.endswith((\".pt\", \".safetensors\"))\n]\nprint(weight_files)\n```\n\n\n---\n\n# Ternary Expressions\n\nPython's ternary expression is `value_if_true if condition else value_if_false`. It is useful for simple value selection.\n\n\n**main.py**\n\n```python\nscore = 0.85\n\n# Ternary expression\nlabel = \"high confidence\" if score > 0.8 else \"low confidence\"\nprint(label)\n\nuse_large_model = True\nmodel = \"large-model\" if use_large_model else \"small-model\"\nprint(model)\n\nuser_temperature = None\ntemperature = user_temperature if user_temperature is not None else 0.7\nprint(temperature)\n\nfor name, val in [(\"accuracy\", 0.92), (\"loss\", 0.45)]:\n status = \"good\" if val > 0.8 else \"needs work\"\n print(f\"{name}: {val} ({status})\")\n```\n\n\nKeep ternary expressions short. If the logic is more complex, use a regular `if`/`else` block. Nested ternaries are legal, but they are usually harder to read than they are worth.\n\n---\n\n# None Checks and Truthiness\n\nPython's `None` represents an explicit missing value. Python also has a broader concept of \"truthiness\", which you need to understand to avoid subtle bugs.\n\n### What is Falsy in Python?\n\nThese values all evaluate to `False` in a boolean context:\n\n\n| Value | Type | Note |\n|-------|------|------|\n| `None` | NoneType | The explicit \"nothing\" value |\n| `False` | bool | The boolean false value |\n| `0` | int | Zero is falsy |\n| `0.0` | float | Zero float is falsy |\n| `\"\"` | str | Empty string is falsy |\n| `[]` | list | Empty list is falsy |\n| `{}` | dict | Empty dict is falsy |\n| `set()` | set | Empty set is falsy |\n\n\nMost other values are truthy. This is convenient for empty-collection checks, but it can misfire when `0`, `\"\"`, or `[]` are valid inputs.\n\n### The Difference Between `is None` and Truthiness\n\n\n**main.py**\n\n```python\ndef get_embedding(text, model=None):\n if model is None:\n model = \"text-embedding-3-small\"\n return compute_embedding(text, model)\n\ndef process_scores(scores=None):\n if scores is None:\n scores = []\n return scores\n\nresults = []\nif not results:\n print(\"No results found\")\n\nresponse = api_call()\nif response:\n process(response)\n```\n\n\n### The Common Pattern: Default Mutable Arguments\n\nThis is a common Python mistake. Do not use a mutable default argument:\n\n\n**main.py**\n\n```python\ndef add_document(doc, collection=[]):\n collection.append(doc)\n return collection\n\ndef add_document(doc, collection=None):\n if collection is None:\n collection = []\n collection.append(doc)\n return collection\n```\n\n\nThe problem with the first version is that the default `[]` is created once when the function is defined, not each time it is called. Every call without an explicit `collection` argument shares and mutates the same list. This comes up often in AI code when you are accumulating documents, messages, examples, or results.\n\n---\n\n### Summary\n\nKey takeaways:\n\n- **Lists** are your default collection for ordered, mutable data. Use them for token sequences, embeddings, and batch results.\n- **Dicts** map keys to values. Use them for model configs, vocabularies, and API payloads. Use `.get()` when a missing key has a real fallback.\n- **Sets** provide O(1) membership testing. Use them for stopword filtering and deduplication.\n- **Tuples** are immutable ordered values. Use them for small multi-value returns and dict keys.\n- **Comprehensions** (list, dict, set) are a concise way to transform and filter data when the logic stays simple.\n- **Tuple unpacking** lets you assign multiple values at once. Combined with `enumerate` and `zip`, it eliminates most index-based loops.\n- **F-strings** handle all string formatting. Use format specifiers like `:.2f` for floats and `:.1%` for percentages.\n- **Slicing** with `[start:stop:step]` works on lists, strings, and (later) arrays and tensors. The same syntax transfers to NumPy and PyTorch.\n- **Python idioms** like `enumerate`, `zip`, `any/all`, `sorted(key=...)`, and `dict.get()` reduce boilerplate in everyday data-processing code.\n- **None checks** should use `is None`, not truthiness, when `0`, `\"\"`, or `[]` are valid values. And never use mutable default arguments.\n\n---\n\n# Quiz\n\n---\n\n### References\n\n- [Python Data Structures (Official Docs)](https://docs.python.org/3/tutorial/datastructures.html)\n- [PEP 572: Assignment Expressions (Walrus Operator)](https://peps.python.org/pep-0572/)\n- [PEP 498: Literal String Interpolation (F-Strings)](https://peps.python.org/pep-0498/)\n- [Python Built-in Functions](https://docs.python.org/3/library/functions.html)\n- [Real Python: Python String Methods](https://realpython.com/python-strings/)","pageType":"ai-engineering"}

Get Premium