less than 1 minute read

The next Open Mic session starts on Friday 23rd May 2025 at 9:30 at this link. Speaker Evgenii Grigorev will unpack how Large Language Models (LLMs) like GPT-4 and CodeLlama generate code, blending theory with real-world examples.

LLM powered coder assistants
Abstract

Code-generating LLMs are not wizards — they’re sophisticated pattern matchers trained on terabytes of code. But how do they turn a prompt like “Sort this CSV by date and calculate weekly averages” into working Python? This session will demystify the :

  • Core mechanics – Transformers, attention layers, and tokenization
  • Training secrets: From GitHub scrapes to context-aware fine-tuning.
  • Why they fail: Hallucinations, hidden biases, and the “copy-paste paradox”.
  • Examples from data analysis (Pandas, SQL) will illustrate key concepts.

Outline

  1. Introduction to LLMs: Transformers, tokenization, and the “autocomplete on steroids” paradigm.

  2. Tools Deep Dive: GitHub Copilot, ChatGPT, CodeWhisperer, and open-source alternatives (StarCoder, Llama 3).

  3. Under the Hood: Training on GitHub data, context window limitations, and safety guardrails.

  4. Pros vs. Cons: 55% faster coding (GitHub study) vs. 40% of generated code containing vulnerabilities (Stanford research).

Further reading: