In-Depth

See Prompts Microsoft Engineers Use for Bleeding-Edge Multimodal RAG AI Research

Everybody scrambling to get good at prompt engineering might want to take a look at a couple examples used by Microsoft engineers doing bleeding-edge research into the hot new field of multimodal retrieval-augmented generation (RAG).

Multimodal RAG is an AI technique that retrieves and integrates information from multiple data types, like text, audio and images, to generate more comprehensive and context-aware responses from systems like large language model (LLM) constructs.

In this case Microsoft's Industry Solutions Engineering (ISE) team is eyeing what can be done in the field of vision with multimodal RAG, so the research is for further learning, not for stuff developer can get their hands on right now -- or likely to be baked into any products anytime soon.

"As part of a recent project, our team addressed this scenario [enterprise multimodal RAG] by following a pattern of multimodal RAG that utilizes a multimodal LLM such as GPT-4V or GPT-4o to effectively transform image content to a text format by generating detailed descriptions of each image," explained members of the team in an Oct. 11 blog post titled "Multimodal RAG with Vision: From Experimentation to Implementation."

"This conversion enables the text content and textual image descriptions to be stored in the same vector database space, which can then be retrieved and used as context for the LLM via the standard RAG pipeline flow. Our goal was to improve search precision and relevance while ensuring meaningful LLM-generated responses to user queries through detailed experimentation."

Inference Flow Architecture
[Click on image for larger view.] Inference Flow Architecture (source: Microsoft).

The team shared its experimentation journey of fine-tuning a multimodal RAG pipeline to best answer user queries that require both textual and image context.

The detailed post delves deep into the nitty-gritty research minutiae, but one immediately practical takeaway comes in the prompts the engineers used. Prompt engineering took off with the dawn of the Gen AI era, resulting in job ads teasing $335,000 annual salaries. While the field has cooled off, prompt engineering guidance is still springing up all over the web, some free and some for pay. It's hard to measure the level of the expertise being offered, though, so looking at real-world, cutting-edge work gives you some clues.

That guidance usually looks something like this:

  • Be Clear and Specific: A well-structured prompt should leave little room for ambiguity. Define your request clearly and explicitly, including all necessary details to guide the AI in producing relevant output.
  • Iterate and Refine: Prompt engineering often involves trial and error. Start with a basic prompt, analyze the results, and make adjustments based on the output until you get closer to what you want.
  • Use Constraints for Focused Results: Sometimes, broad prompts can lead to irrelevant or overly generalized responses. Use constraints, such as specific examples or directives, to guide the model's response and keep it on track.
  • Understand the Model's Capabilities: Know what the AI model is capable of and what it struggles with. This helps you design prompts that avoid areas of weakness while taking advantage of the model's strengths.
  • Provide Context When Necessary: For complex tasks or requests, offering relevant context improves the model's understanding. You can offer background information, specify the target audience, or mention the desired tone and format.

Let's look a the ISE team's prompts to see how they stack up against those tips.

"Our journey began with the creation of two specialized prompts: one for ingestion and another for inference," the team said. "This approach was designed to accurately extract image descriptions and enhance query responses."

Prompt for Image Enrichment
The team conducted a thorough analysis of the source images to understand their content and the typical categories of images seen in its document store. Based on this categorization, they tailored the prompt to ensure the LLM focused on the relevant information that each image type contained. "This method ensured that our system could handle a diverse range of images and deliver the precise responses required."

You are an assistant whose job is to provide the explanation of images which is going to be used to retrieve the images. Follow the below instructions:
  • If the image contains any bubble tip then explain ONLY the bubble tip.
  • If the image is an equipment diagram then explain all of the equipment and their connections in detail.
  • If the image contains a table, try to extract the information in a structured way.
  • If the image device or product, try to describe that device with all the details related to shape and text on the device.
  • If the image contains screenshot, try to explain the highlighted steps. Ignore the exact text in the image if it is just an example for the steps and focus on the steps.
  • Otherwise, explain comprehensively the most important items with all the details on the image.

"While this prompt was specifically tailored to our use case, it offers an idea of how to address different image types," the team said. "Depending on the images in the source data, you may need to adjust the prompt to ensure its effectiveness."

Prompt for Inference
This prompt for generating responses to user queries is tailored to meet customer requirements and ensure relevant and accurate answers. It instructs the LLM to return cited images, allowing the team to gather citation metrics and assess the quality of the responses.

You are a helpful AI assistant whose primary goal is to help technicians who maintain and improve the company's communication infrastructure. According to:\n Context: {context}, \n what is the answer to the \n Question:{question}. Provide step by step instructions if it is a procedural question. Do not attempt to answer if the Context provided is empty. Ask them to elaborate the question instead. The output MUST be in a json format where one attribute will be the answer and the other one will be the image_citation which are the image urls that might be useful as a reference to the user. In the context, all image urls will be in this format: (image url). Example of the output format: 'answer': 'Yes, you can replace the cable', 'image_citation': image url

"As with the ingestion prompt, this sample may need to be adjusted based on the expected responses to ensure its effectiveness," the team said.

But How Good Are Those Prompts?
We don't know, so we asked ChatGPT4o, who gave the first prompt a grade of B-.

Strengths:
  • The prompt attempts to use conditions and constraints, which shows an understanding of how to guide the AI's behavior.
  • It provides different instructions for different scenarios, which is a positive step in managing diverse content.
Areas for Improvement:
  • Some parts are vague and open to interpretation, such as "explain comprehensively the most important items," which could lead to inconsistent output.
  • The prompt lacks precision and clarity in certain areas, which may confuse the AI.
  • There is no fallback if the model struggles to understand the image type or misclassifies it, which shows a partial understanding of the model's limitations.
Overall, while the student shows a good effort in applying some key principles of prompt engineering, further refinement is needed to make the prompt clearer, more focused, and more aligned with the model's capabilities.

It gave the second prompt a B+.

Strengths:
  • The prompt clearly defines the expected behavior for the AI and includes well-structured constraints, particularly with the required JSON output format.
  • It acknowledges key scenarios such as handling procedural questions and empty contexts, which is a good step toward managing edge cases.
Areas for Improvement:
  • Some elements, like which image URLs to include and how to handle questions with ambiguous procedural details, could be specified more clearly.
  • The prompt could provide more precise guidance for different question types or when no image URLs are available in the context.
Overall, this prompt shows a strong understanding of prompt engineering but still has room for minor improvements in clarity and how edge cases are handled.

Of course in this education-inspired contrivance, the students could just as well be the teachers, and vice-versa.

About the Author

David Ramel is an editor and writer at Converge 360.

comments powered by Disqus

Featured

Subscribe on YouTube