Reference for ultralytics/nn/text_model.py
Note
This file is available at https://github.com/ultralytics/ultralytics/blob/main/ultralytics/nn/text_model.py. If you spot a problem please help fix it by contributing a Pull Request 🛠️. Thank you 🙏!
ultralytics.nn.text_model.TextModel
TextModel()
Bases: Module
Abstract base class for text encoding models.
This class defines the interface for text encoding models used in vision-language tasks. Subclasses must implement the tokenize and encode_text methods.
Methods:
Name | Description |
---|---|
tokenize |
Convert input texts to tokens. |
encode_text |
Encode tokenized texts into feature vectors. |
Source code in ultralytics/nn/text_model.py
31 32 33 |
|
encode_text
abstractmethod
encode_text(texts, dtype)
Encode tokenized texts into normalized feature vectors.
Source code in ultralytics/nn/text_model.py
40 41 42 43 |
|
tokenize
abstractmethod
tokenize(texts)
Convert input texts to tokens for model processing.
Source code in ultralytics/nn/text_model.py
35 36 37 38 |
|
ultralytics.nn.text_model.CLIP
CLIP(size, device)
Bases: TextModel
Implements OpenAI's CLIP (Contrastive Language-Image Pre-training) text encoder.
This class provides a text encoder based on OpenAI's CLIP model, which can convert text into feature vectors that are aligned with corresponding image features in a shared embedding space.
Attributes:
Name | Type | Description |
---|---|---|
model |
CLIP
|
The loaded CLIP model. |
device |
device
|
Device where the model is loaded. |
Methods:
Name | Description |
---|---|
tokenize |
Convert input texts to CLIP tokens. |
encode_text |
Encode tokenized texts into normalized feature vectors. |
Examples:
>>> from ultralytics.models.sam import CLIP
>>> import torch
>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
>>> clip_model = CLIP(size="ViT-B/32", device=device)
>>> tokens = clip_model.tokenize(["a photo of a cat", "a photo of a dog"])
>>> text_features = clip_model.encode_text(tokens)
>>> print(text_features.shape)
This class implements the TextModel interface using OpenAI's CLIP model for text encoding. It loads a pre-trained CLIP model of the specified size and prepares it for text encoding tasks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
size
|
str
|
Model size identifier (e.g., 'ViT-B/32'). |
required |
device
|
device
|
Device to load the model on. |
required |
Examples:
>>> import torch
>>> from ultralytics.models.sam.modules.clip import CLIP
>>> clip_model = CLIP("ViT-B/32", device=torch.device("cuda:0"))
>>> text_features = clip_model.encode_text(["a photo of a cat", "a photo of a dog"])
Source code in ultralytics/nn/text_model.py
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
|
encode_text
encode_text(texts, dtype=torch.float32)
Encode tokenized texts into normalized feature vectors.
This method processes tokenized text inputs through the CLIP model to generate feature vectors, which are then normalized to unit length. These normalized vectors can be used for text-image similarity comparisons.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts
|
Tensor
|
Tokenized text inputs, typically created using the tokenize() method. |
required |
dtype
|
dtype
|
Data type for output features. Default is torch.float32. |
float32
|
Returns:
Type | Description |
---|---|
Tensor
|
Normalized text feature vectors with unit length (L2 norm = 1). |
Examples:
>>> clip_model = CLIP("ViT-B/32", device="cuda")
>>> tokens = clip_model.tokenize(["a photo of a cat", "a photo of a dog"])
>>> features = clip_model.encode_text(tokens)
>>> features.shape
torch.Size([2, 512])
Source code in ultralytics/nn/text_model.py
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
|
tokenize
tokenize(texts)
Convert input texts to CLIP tokens.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts
|
str | List[str]
|
Input text or list of texts to tokenize. |
required |
Returns:
Type | Description |
---|---|
Tensor
|
Tokenized text tensor with shape (batch_size, context_length) ready for model processing. |
Examples:
>>> model = CLIP("ViT-B/32", device="cpu")
>>> tokens = model.tokenize("a photo of a cat")
>>> print(tokens.shape) # torch.Size([1, 77])
Source code in ultralytics/nn/text_model.py
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
|
ultralytics.nn.text_model.MobileCLIP
MobileCLIP(size, device)
Bases: TextModel
Implement Apple's MobileCLIP text encoder for efficient text encoding.
This class implements the TextModel interface using Apple's MobileCLIP model, providing efficient text encoding capabilities for vision-language tasks.
Attributes:
Name | Type | Description |
---|---|---|
model |
MobileCLIP
|
The loaded MobileCLIP model. |
tokenizer |
callable
|
Tokenizer function for processing text inputs. |
device |
device
|
Device where the model is loaded. |
config_size_map |
dict
|
Mapping from size identifiers to model configuration names. |
Methods:
Name | Description |
---|---|
tokenize |
Convert input texts to MobileCLIP tokens. |
encode_text |
Encode tokenized texts into normalized feature vectors. |
Examples:
>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
>>> text_encoder = MobileCLIP(size="s0", device=device)
>>> tokens = text_encoder.tokenize(["a photo of a cat", "a photo of a dog"])
>>> features = text_encoder.encode_text(tokens)
This class implements the TextModel interface using Apple's MobileCLIP model for efficient text encoding.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
size
|
str
|
Model size identifier (e.g., 's0', 's1', 's2', 'b', 'blt'). |
required |
device
|
device
|
Device to load the model on. |
required |
Examples:
>>> from ultralytics.nn.modules import MobileCLIP
>>> import torch
>>> model = MobileCLIP("s0", device=torch.device("cpu"))
>>> tokens = model.tokenize(["a photo of a cat", "a photo of a dog"])
>>> features = model.encode_text(tokens)
Source code in ultralytics/nn/text_model.py
164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 |
|
encode_text
encode_text(texts, dtype=torch.float32)
Encode tokenized texts into normalized feature vectors.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts
|
Tensor
|
Tokenized text inputs. |
required |
dtype
|
dtype
|
Data type for output features. |
float32
|
Returns:
Type | Description |
---|---|
Tensor
|
Normalized text feature vectors with L2 normalization applied. |
Examples:
>>> model = MobileCLIP("s0", device="cpu")
>>> tokens = model.tokenize(["a photo of a cat", "a photo of a dog"])
>>> features = model.encode_text(tokens)
>>> features.shape
torch.Size([2, 512]) # Actual dimension depends on model size
Source code in ultralytics/nn/text_model.py
222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 |
|
tokenize
tokenize(texts)
Convert input texts to MobileCLIP tokens.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts
|
list[str]
|
List of text strings to tokenize. |
required |
Returns:
Type | Description |
---|---|
Tensor
|
Tokenized text inputs with shape (batch_size, sequence_length). |
Examples:
>>> model = MobileCLIP("s0", "cpu")
>>> tokens = model.tokenize(["a photo of a cat", "a photo of a dog"])
Source code in ultralytics/nn/text_model.py
206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 |
|
ultralytics.nn.text_model.MobileCLIPTS
MobileCLIPTS(device)
Bases: TextModel
Load a TorchScript traced version of MobileCLIP.
This class implements the TextModel interface using Apple's MobileCLIP model, providing efficient text encoding capabilities for vision-language tasks.
Attributes:
Name | Type | Description |
---|---|---|
encoder |
MobileCLIP
|
The loaded MobileCLIP text encoder. |
tokenizer |
callable
|
Tokenizer function for processing text inputs. |
device |
device
|
Device where the model is loaded. |
Methods:
Name | Description |
---|---|
tokenize |
Convert input texts to MobileCLIP tokens. |
encode_text |
Encode tokenized texts into normalized feature vectors. |
Examples:
>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
>>> text_encoder = MobileCLIP(device=device)
>>> tokens = text_encoder.tokenize(["a photo of a cat", "a photo of a dog"])
>>> features = text_encoder.encode_text(tokens)
This class implements the TextModel interface using Apple's MobileCLIP model for efficient text encoding.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
device
|
device
|
Device to load the model on. |
required |
Examples:
>>> from ultralytics.nn.modules import MobileCLIP
>>> import torch
>>> model = MobileCLIP(device=torch.device("cpu"))
>>> tokens = model.tokenize(["a photo of a cat", "a photo of a dog"])
>>> features = model.encode_text(tokens)
Source code in ultralytics/nn/text_model.py
269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 |
|
encode_text
encode_text(texts, dtype=torch.float32)
Encode tokenized texts into normalized feature vectors.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts
|
Tensor
|
Tokenized text inputs. |
required |
dtype
|
dtype
|
Data type for output features. |
float32
|
Returns:
Type | Description |
---|---|
Tensor
|
Normalized text feature vectors with L2 normalization applied. |
Examples:
>>> model = MobileCLIP(device="cpu")
>>> tokens = model.tokenize(["a photo of a cat", "a photo of a dog"])
>>> features = model.encode_text(tokens)
>>> features.shape
torch.Size([2, 512]) # Actual dimension depends on model size
Source code in ultralytics/nn/text_model.py
308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 |
|
tokenize
tokenize(texts)
Convert input texts to MobileCLIP tokens.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts
|
list[str]
|
List of text strings to tokenize. |
required |
Returns:
Type | Description |
---|---|
Tensor
|
Tokenized text inputs with shape (batch_size, sequence_length). |
Examples:
>>> model = MobileCLIP("cpu")
>>> tokens = model.tokenize(["a photo of a cat", "a photo of a dog"])
Source code in ultralytics/nn/text_model.py
292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 |
|
ultralytics.nn.text_model.build_text_model
build_text_model(variant, device=None)
Build a text encoding model based on the specified variant.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
variant
|
str
|
Model variant in format "base:size" (e.g., "clip:ViT-B/32" or "mobileclip:s0"). |
required |
device
|
device
|
Device to load the model on. |
None
|
Returns:
Type | Description |
---|---|
TextModel
|
Instantiated text encoding model. |
Examples:
>>> model = build_text_model("clip:ViT-B/32", device=torch.device("cuda"))
>>> model = build_text_model("mobileclip:s0", device=torch.device("cpu"))
Source code in ultralytics/nn/text_model.py
330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 |
|