The encoded conditioning data is exposed to denoising U-Nets via a cross-attention mechanism. The denoising step can be flexibly conditioned on a string of text, an image, or another modality. Finally, the VAE decoder generates the final image by converting the representation back into pixel space. The U-Net block, composed of a ResNet backbone, denoises the output from forward diffusion backwards to obtain a latent representation. Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion. The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, capturing a more fundamental semantic meaning of the image. Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder. Introduced in 2015, diffusion models are trained with the objective of removing successive applications of Gaussian noise on training images, which can be thought of as a sequence of denoising autoencoders. Stable Diffusion uses a kind of diffusion model (DM), called a latent diffusion model (LDM) developed by the CompVis group at LMU Munich. The model generates images by iteratively denoising random noise until a configured number of steps have been reached, guided by the CLIP text encoder pretrained on concepts along with the attention mechanism, resulting in the desired image depicting a representation of the trained concept. Technology Diagram of the latent diffusion architecture used by Stable Diffusion The denoising process used by Stable Diffusion. In October 2022, Stability AI raised US$101 million in a round led by Lightspeed Venture Partners and Coatue Management. Stability AI also credited EleutherAI and LAION (a German nonprofit which assembled the dataset on which Stable Diffusion was trained) as supporters of the project. Development was led by Patrick Esser of Runway and Robin Rombach of CompVis, who were among the researchers who had earlier invented the latent diffusion model architecture used by Stable Diffusion. The technical license for the model was released by the CompVis group at Ludwig Maximilian University of Munich. The development of Stable Diffusion was funded and shaped by the start-up company Stability AI. This marked a departure from previous proprietary text-to-image models such as DALL-E and Midjourney which were accessible only via cloud services. Its code and model weights have been released publicly, and it can run on most consumer hardware equipped with a modest GPU with at least 8 GB VRAM. Stable Diffusion is a latent diffusion model, a kind of deep generative artificial neural network. It was developed by researchers from the CompVis Group at Ludwig Maximilian University of Munich and Runway with a compute donation by Stability AI and training data from non-profit organizations. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt. Stable Diffusion is a deep learning, text-to-image model released in 2022.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |