I think the AI community has underestimated the value of large context windows and fully multimodal AI (can see video, as well as documents and texts) as a solution to many real-world AI problems. I find that, when working inside the context window, even a million tokens worth, the AI both reasons very well and has very low rates of hallucinations. And an AI that can see enables entirely different modalities for using AI systems. Here, I give Gemini 1.5 a video of my screen (it would be trivial for it to watch live, of course), and it accurately understands what I am doing and what I could do better. Gemini Pro 1.5 feels like working with GPT-4 after using GPT-3.5. The underlying model still isn't "smart" enough to do everything you want, but the added context window and ability to hold entire videos or folders of documents just makes the experience feel superhuman. AI, for better or worse, as manager and advisor. Superclippy for real.
Gemini API Features
Explore top LinkedIn content from expert professionals.
-
-
Ok - here is a full technical breakdown of what we know about Gemini: * There are 3 model sizes: Ultra, Pro, Nano. The only disclosed size is for Nano and it's 1.8 & 3.25B. The info is not particularly useful because we could have bound the size given it runs on Pixel. * Ultra follows Chinchilla scaling laws - as the idea is to get the best possible perf for a given compute budget. Inference was not a concern, PR was, you want bold numbers. Smaller models are all heavily in the data saturation regime. * Gemini is natively multimodal - i.e. trained from scratch on different modalities. Compare that with Flamingo: step 1) train an LLM - Chinchilla step 2) train a vision encoder using contrastive pre-training step 3) freeze the backbones and train the system e2e input: text, audio, image, video output: text, image (big advantage compared to GPT-4 V) Multimodal demos: * https://lnkd.in/dJZTkB79 * https://lnkd.in/dEBq-aMh * Needless to say the model is massively multilingual as well. ---COMPUTE--- I'll just leave you with this excerpt from the paper: "but at Gemini Ultra scale, we combine SuperPods in multiple datacenters using Google’s intra-cluster and inter-cluster network. Google’s network latencies and bandwidths are sufficient to support the commonly used synchronous training paradigm, exploiting model parallelism within superpods and data-parallelism across superpods." parallelism across GPUs - sweet - how about parallelism across data centers :) ---EVALS--- * Gemini Ultra’s performance exceeds current SOTA results on 30 of the 32 academic benchmarks BUT IMPORTANT NOTE: not clear how this translates into actual perf given data contamination & in general eval issues with LLMs. * They report 90% on MMLU which is better than human experts and better than GPT-4 but again it seems the eval methodology was not the same (number of shots + CoT look to be different). * Report better results than GPT-4V on the new MMMU (multimodal) benchmark. In general I wouldn't give eval numbers too much attention bc it's not clear whether the comparison is fair between GPT-4 & Gemini - i just want to play with the model. :) ---MISC UPDATES--- * They also share AlphaCode 2 that is estimated to perform better than 85% of codeforces competitive programming competition participants (compared to 50% from original AlphaCode). It leveraged Gemini Pro to get to these results. * They introduce TPU v5p with 2x more FLOPS and 3x more HBM. A single pod composes of 8960 chips! * Gemini is already powering many google products (a fine-tuned version of Pro is already in Bard, Nano is running on Pixel) * on december 13th Pro will be accessible through an API! Each one of these could be an update in its own right - I simply dislike Google's shipping strategy. It's a firehose method, and if we're talking about safe deployment it's much better to gradually give people access to these systems - a la OpenAI. Just my 2 cents.
-
The release of Google's Gemini Pro 1.5 is, IMO, the biggest piece of A.I. news yet this year. The LLM has a gigantic million-token context window, multimodal inputs (text, code, image, audio, video) and GPT-4-like capabilities despite being much smaller and faster. Key Features 1. Despite being a mid-size model (so much faster and cheaper), its capabilities rival the full-size models Gemini Ultra 1.0 and GPT-4, which are the two most capable LLMs available today. 2. At a million tokens, its context window demolishes Claude 2, the foundation LLM with the next longest context window (Claude 2's is only a fifth of the size at 200k). A million tokens corresponds to 700,000 words (seven lengthy novels) and Gemini Pro 1.5 accurately retrieves needles from this vast haystack 99% of the time! 3. Accepts text, code, images, audio (a million tokens corresponds to 11 hours of audio), and video (1MM tokens = an hour of video). Today's episode contains an example of Gemini Pro 1.5 answering my questions about a 54-minute-long video with astounding accuracy and grace. How did Google pull this off? • Gemini Pro 1.5 is a Mixture-of-Experts (MoE) architecture, routing your input to specialized submodels (e.g., one for math, one for code, etc.), depending on the broad topic of your input. This allows for focused processing and explains both the speed gains and high capability level despite being a mid-size model. • While OpenAI also uses the MoE approach in GPT-4, Google seems to have achieved greater efficiency with the approach. This edge may stem from Google's pioneering work on MoE (Google were the first to publish on MoE, way back in 2017) and their resultant deep in-house expertise on the topic. • Training-data quality is also a likely factor in Google's success. What's next? • Google has 10-million-token context-windows in testing. That order-of-magnitude jump would correspond to future Gemini releases being able to handle ~70 novels, 100 hours of audio or 10 hours of video. • If Gemini Pro 1.5 can achieve GPT-4-like capabilities, the Gemini Ultra 1.5 release I imagine is in the works may allow Google to leapfrog OpenAI and reclaim their crown as the world's undisputed A.I. champions (unless OpenAI gets GPT-5 out first)! Want access? • Gemini Pro 1.5 is available with a 128k context window through Google AI Studio and (for enterprise customers) through Google Cloud's Vertex AI. • There's a waitlist for access to the million-token version (I had access through the early-tester program). Check out today's episode (#762) for more detail on all of the above (including Gemini 1.5 Pro access/waitlist links). The Super Data Science Podcast is available on all major podcasting platforms and a video version is on YouTube. #superdatascience #machinelearning #ai #llms #geminipro #geminiultra
-
I was fortunate to receive an invitation for early access to Google's new Gemini 1.5 pro model, which boasts a 1 million token context window. If you want to experiment with it, here are a few things you need to know to get started. It was released yesterday to the public in a low-key announcement primarily aimed at developers. 1. You can access it in AI Studio. (Link in comments) 2. AI Studio is free. 3. In AI Studio, the interface doesn't natively save your chat history. (It is designed for developers to test prompts in different ways with models.) However, you can save your prompts to a library. (Note: Officially, it doesn't save chat history...But I have noticed my last few saved prompts include the chat history, so I hope that is a newly upgraded feature since they are improving it continuously.) 4. You can test prompts in different models in three ways: a chat interface, freeform prompts, and structured prompts. You can learn how each type works using their tutorials. 5. With the Gemini 1.5 Pro model, you can, for the first time, upload video to an LLM as an input 🤯 6. The video, however, does not have audio modality - for now. Technically, the AI is ingesting the video frame by frame as stills, but it can read timestamps in the video. 7. For any response, you can use the "get code" button to get the lines of code vs text, which you can copy and paste. 8. Expect responses (especially with video inputs) to take a bit longer than you are used to with smaller context text only or text plus images inputs. This early peek at Gemini 1.5 pro is mind-blowing, especially considering it is still in its most primitive state. Iterative releases will only improve it from here. Using it over these last few weeks has already changed my perspective on much of the progress made in AI in the past several years. I will share more of my thoughts about that soon, but for now, I wanted to share the tips on access and how to use it so that you can also get a peek into it and try it out over the weekend. #ai #google #gemini
-
Really excited about Gemini 1.5 Pro for... you guessed it! #accessibility! Here's me checking if a code snippet meets a specific success criteria, and the response nails it! It's helpful, accurate, and to the point. It's also easy to read (succinct with a clear structure). Overall great, and a big improvement over the other models I've used so far. Bonus: with a 1 million token context window, I'm able to add the entire set of Web Content Accessibility Guidelines with failures, techniques, and understanding documents, and getting improved responses as a result. This will likely help with more open ended prompts, like "Which success criteria does this fail for?" Lot's more to discover there. Super excited for this to roll out to the community. In the meantime, let me know if you have ideas you want me to try! (HT Leonie Watson's post on TetraLogical blog for prompt inspiration) --- Prompt (including all of W3C WCAG): Does this meet WCAG 2.2 Level AA SC 2.5.3, which states: For user interface components with labels that include text or images of text, the name contains the text that is presented visually. <p>It was reported today that a dog ate my homework. <a href="news.html" aria-label="Dog ate my homework">Read more...</a> </p> --- Output: No, the provided code snippet does not meet WCAG 2.2 Level AA SC 2.5.3. Here's why: 1. The link contains text ("Read more...") that is presented visually. 2. However, the `aria-label` attribute overrides the visible text and sets the accessible name to "News". This mismatch between the visual label and the accessible name violates SC 2.5.3. To fix this, you have two options: Option 1: Remove the `aria-label` attribute. This will allow the accessible name to default to the visible text "Read more...". Option 2: Modify the `aria-label` attribute to include the visible text. For example, you could use `aria-label="Read more about news"`. This ensures that the accessible name contains the text presented visually. Remember, the goal is to ensure consistency between what sighted users see and what users of assistive technologies like screen readers hear or read. This consistency is crucial for efficient navigation and understanding of the content. --- #WCAG #Gemini #A11y #AI Image shows a screenshot of Google AI Studio showing all of WCAG loaded into the prompt!
-
For those of use who have been building production applications with LLMs, context window capacity (and accuracy!) has been a constant, limiting factor. Gemini 1.5 introducing a state of the art 1M+ token context window AND achieves near-perfect “needle” recall (>99.7%) up to 1M tokens of “haystack” in all modalities, i.e., text, video and audio. So absolutely excited for all the potential this unlocks. Technical report: https://lnkd.in/eViEvdzF
-
So, Gemini 1.5 has recently been released, but what's new and different about it? Gemini 1.5 Pro is an advanced Transformer-based model using a sparse mixture-of-experts (MoE) approach, building on the multimodal capabilities of its predecessor, Gemini 1.0. It incorporates extensive MoE and language model research, allowing it to efficiently handle inputs by activating only relevant parameters. Gemini 1.5 Pro demonstrates significant advancements in multimodal understanding and computational efficiency. Below are the key features that you need to know about: • 𝗘𝘅𝘁𝗲𝗻𝗱𝗲𝗱 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗟𝗲𝗻𝗴𝘁𝗵: Can understand inputs up to 10 million tokens, significantly more than its predecessors, enabling processing of almost a day of audio, large codebases, or extended video content. • 𝗠𝘂𝗹𝘁𝗶𝗺𝗼𝗱𝗮𝗹 𝗖𝗮𝗽𝗮𝗯𝗶𝗹𝗶𝘁𝗶𝗲𝘀: Natively supports and interleaves data from different modalities (audio, visual, text, code) in the same input sequence. • 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆 𝗮𝗻𝗱 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲: Achieves comparable or superior quality to previous models like Gemini 1.0 Ultra, with significantly less training compute and enhanced serving efficiency. So when should you use it? Gemini 1.5 Pro excels in processing and understanding complex multimodal data sets over extended contexts. This makes it ideal for applications requiring deep contextual analysis and the integration of diverse data types, such as advanced natural language understanding, multimodal content creation and analysis, real-time translation and transcription, large-scale data analysis, and interactive AI systems. Its efficiency and performance in these areas stem from significant improvements in architecture, data handling, and computational efficiency. Paper: https://lnkd.in/eQbbBQdB
-
Re: Gemini 1.5’s 1M token window, I saw Mike Kaput test it on the Marketing AI Podcast with Paul Roetzer using a 500-page (possibly boring) government document. It worked so well I needed to try it. This week Gemini expanded its capabilities to have it ingest video, so I started thinking of ways to experiment. My test: summarizing a movie from the Public Domain - the Night of the Living Dead. First, I could only get about 40 minutes of the movie in – a little over 715,000 tokens. Gemini 1.5 Pro seemed to balk at higher amounts even though the token window should have been able to handle it. Still, it did an excellent job of summarizing the video. Really excellent. The use cases: 1. If you have a large repository of videos from webinars, you could summarize them for better engagement and more engagement. 2. If you are producing video content today, this gives you another and better path for doing summaries. Sure, recorders can get a transcript and build a summary from the transcript, but so far, I’ve found these summaries to be just okay. The irony is that Gemini will let you do this with a video you upload, but not by pointing it at a YouTube video. Once this happens, think of the applications. Also, I think Gemini 1.5 Pro seems to handle transcript summaries better than the other LLMs even though the token count is low.
-
Finally read the tech report on Gemini, #Google’s most capable LLM. Here are some of the interesting details that are often overlooked: - Gemini models are multimodal by design and really shine in various multimodal capabilities. Not only can they directly consume text, audio, images, and video, but they can also directly generate images. The provided examples do look good. I wonder at some point such multimodal models might become the backbone or starting point for more narrow image generation systems. - The on-device model is trained with all the best practices like distillation and 4-bit quantization. It is also trained with significantly more tokens since it requires much less compute. If done right, this model should be very capable for its inference cost. - Gemini’s speech recognition is already as good as the Whisper models (larger v3) which is a big deal considering it’s not specifically trained for it. Again highlighting the multimodal capabilities of the model. - Training at data center scale, takes a lot of systems engineering wizardry beyond just knowledge in learning. Rare errors in normal training that we usually ignore become frequent and must be handled gracefully.
-
Google Unveils Gemini: A Multimodal AI Model with Human-like Performance Google Research has unveiled Gemini, a family of multimodal AI models that demonstrate human-level performance across diverse tasks. Boasting capabilities in image, audio, video, and text domains, Gemini represents a significant advancement in the field of artificial intelligence. Key Highlights: Human-Expert Performance: Gemini Ultra, the most advanced model, surpasses human experts on 57 subjects in the MMLU benchmark, achieving a score above 90%. Multimodal Reasoning: Gemini excels at tasks requiring both understanding and reasoning across different modalities. It can solve math problems from handwritten notes, analyze charts and generate tables, and even answer questions about video content. State-of-the-Art Benchmarks: Gemini sets new state-of-the-art results across 30 out of 32 benchmarks, including text, image, video, and speech understanding tasks. Democratizing Access: Available in various sizes, Gemini caters to different needs. Nano models are designed for on-device usage, Pro models are ideal for data centers, and the Ultra model tackles highly complex tasks. Responsible Development: Google emphasizes responsible deployment, addressing potential bias and harmful outputs through careful fine-tuning and instruction tuning. Applications: Education: Gemini's capabilities offer immense potential in education, providing personalized learning experiences and assisting students with complex concepts. Science & Research: Gemini can accelerate scientific discovery by analyzing vast data sets and generating insights across disciplines. Productivity & Creativity: Gemini can empower users through intelligent assistance in tasks like writing, coding, and problem-solving. Accessibility: Gemini's ability to process diverse modalities makes it a valuable tool for individuals with disabilities. Availability: As of today, Gemini Pro powers Bard, Google's AI-powered chatbot. On December 13th, developers can access Gemini Pro through APIs. Android users will have access to the Nano model on Pixel 8 Pro devices. Bard Advanced, powered by Gemini Ultra, will launch early next year. https://lnkd.in/gptk-K88 This groundbreaking technology marks a significant leap forward in AI, paving the way for a future where machines can collaborate with humans and solve problems in ways that were once unimaginable.