tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Generative AI > Large Language Models > Tokens and Embeddings in LLMs

Tokens and Embeddings in LLMs

Author: Venkata Sudhakar

Tokens are the basic units that LLMs work with. Rather than processing text character-by-character or word-by-word, LLMs split text into tokens using a process called tokenisation. A token typically represents a word, a sub-word, or a punctuation mark. For example the word "unhappiness" might be split into three tokens: "un", "happiness", and "s" - and the model learns patterns at the token level. The OpenAI GPT family uses a tokeniser called Byte Pair Encoding (BPE), where common words are single tokens and rare words are broken into sub-word tokens.

Embeddings are numerical vector representations of tokens. Each token is mapped to a dense vector of floating-point numbers, for example 1536 dimensions in text-embedding-3-small. These vectors capture semantic meaning - words with similar meanings have vectors that are numerically close together in high-dimensional space. This is why you can compute "king - man + woman = queen" in embedding space. Embeddings are the foundation for semantic search, clustering, classification, and Retrieval-Augmented Generation (RAG) pipelines.

Understanding token count matters for cost and context limits. GPT-4o has a 128,000 token context window, meaning the total of your prompt and response must fit within that limit. Each API call is billed per token. The below example shows how to count tokens using the tiktoken library before making an API call.


It gives the following output,

Text: Large Language Models are trained on vast amounts of text data.
Token count: 12
Token IDs: [35353, 11688, 27972, 527, 16572, 389, 13057, 15055, 315, 1495, 828, 13]
Decoded tokens: ['Large', ' Language', ' Models', ' are', ' trained', ' on', ' vast', ' amounts', ' of', ' text', ' data', '.']

The below example shows how to generate embeddings using the OpenAI Embeddings API and measure semantic similarity between two sentences using cosine similarity.


It gives the following output,

Embedding dimensions: 1536
Similarity (cat/mat vs feline/rug): 0.8921
Similarity (cat/mat vs stock market): 0.1243

A cosine similarity score close to 1.0 means the sentences are semantically similar, while a score close to 0.0 means they are unrelated. This property is what makes embeddings powerful for building semantic search engines and RAG pipelines, where you retrieve the most relevant documents to a user query before passing them to an LLM for answering.


 
  


  
bl  br