4 GenAI Applications in Data Engineering Workflows

This blog post summarizes the key insights from MotherDuck’s recent webinar titled “4 Lightning Talks on Practical AI Workflows from Notion, 1Password, MotherDuck & Evidence”. It features lightning talks and a panel discussion on the evolution of data warehousing and the practical applications of Artificial Intelligence – especially the applications of GenAI in the data engineering landscape.

Introducing MotherDuck: Data Warehousing for the Post Big Data Era

MotherDuck positions itself as a collaborative cloud data warehouse designed for the modern era, moving beyond the traditional “big data” paradigm. It focuses on providing fast performance and efficient pricing, allowing users to build interactive data applications. MotherDuck enables hybrid queries over local and remote datasets, allowing one to run queries even from a laptop.

Leveraging Modern Hardware: Unlike systems built in the 2000s when computing power was significantly limited, MotherDuck takes advantage of the abundant processing power and memory available in today’s machines and cloud instances. For instance, a top laptop in 2006 had 1 core and 2GB of RAM, whereas modern laptops boast 40 cores and 36GB of RAM, and cloud instances offer even more computing power with relatively affordable price tags.
Built on DuckDB: MotherDuck is built on top of DuckDB, an open-source, in-process analytical database known for its research-driven and high-performance capabilities.
Focus on Accessibility: MotherDuck aims to provide data warehousing and analytics solutions for businesses, emphasizing serverless architecture and accelerated value.
Smaller Analytical Datasets: The webinar highlighted that for many analytical use cases, the actual data volume is smaller than what was initially projected in the “big data” era.

#1: Boosting App Development Productivity with Cursor

Archie from Evidence shared his experience using Cursor, an AI-powered code editor, to enhance productivity in data app development.

Enhanced Context for LLMs: Cursor, built on VS Code, provides Large Language Models (LLMs) with more context from the codebase compared to chat-based interfaces.
- Instead of repeatedly providing context about the code, Cursor has inherent knowledge of the entire project.
Seamless Integration with BI Tools: Archie demonstrated using Cursor with Evidence, a BI tool that allows version control for reports written in markdown with special components.
- He showcased how Cursor could generate a new page about orders within the Evidence project with a single natural language prompt.
- The tool also uses a diff-style visualization to highlight the changes made, allowing for easy review.
Natural Language Code Manipulation: Cursor enables users to modify code using natural language commands.
- Archie illustrated this by asking Cursor to add a specific column (“iso_code”) to a chart, which the tool successfully implemented.

#2: Smarter CRM Data Cleaning with LLMs and Iterative Prompting

Nate from 1Password discussed how Large Language Models (LLMs) can be used within SQL to improve CRM data cleaning and enrichment.

Addressing CRM Hygiene Challenges: Go-to-market teams often struggle with CRM data quality, particularly with fields like industry information. Traditionally, this has been a manual and time-consuming process.
- Nate recalled a previous experience where the team manually updated account information in the CRM.
Leveraging LLMs in Snowflake: Inspired by the use of LLMs in SQL with MotherDuck, Nate experimented with this capability within Snowflake.
Defining Industry Boundaries: To ensure relevant and usable industry classifications, it’s crucial to provide the LLM with a predefined set of industries.
Prompt Engineering for Accuracy: Initial attempts to extract industry information resulted in verbose responses that were difficult to parse. Prompt engineering was necessary to get the LLM to return concise, one-word or phrase answers.
Incorporating Industry Descriptions: The key to achieving high accuracy was to include industry descriptions in the prompt, providing the LLM with clear definitions.
- For example, defining “hospitality” and “construction materials” helped the LLM correctly classify accounts.
Iterative Improvement: The ability to quickly iterate and refine prompts allowed for significant improvements in the accuracy of industry classification.
Practical Application and Results: Nate shared an example query and results from a test dataset, demonstrating the potential for LLMs to accurately identify industries based on company names and other CRM data. While not perfect, the accuracy was high, especially for companies with more online presence.

#3: Auto-generating Data Catalog Descriptions with GenAI Tools

Evelyn from Notion presented their strategy for using LLMs to automatically generate table and column descriptions for their data catalog.

The Problem of Incomplete Metadata: A common challenge with data catalogs is the lack of comprehensive metadata, making them less valuable for users. Empty table and column descriptions hinder self-service data exploration.
LLMs for Metadata Automation: LLMs can automate the tedious process of filling out metadata, making data catalogs more useful.
Providing Context to the LLM: The quality of generated descriptions depends on the context provided to the LLM. Notion provides the LLM with SQL definitions, upstream definitions (JSON schema), internal documentation, and data types.
Leveraging Upstream Descriptions: Notion ingests generated upstream descriptions to ensure consistency in column descriptions across related tables.
Importance of Review and Feedback: To ensure accuracy, all generated descriptions undergo human review. An automated process tags table owners for review, allowing them to suggest changes or provide feedback to the LLM for regeneration.
Real-World Examples: Evelyn shared examples of how the LLM at Notion successfully generated clear and accurate descriptions for cryptic column names and complex tables.

#4: Building a Recommendation Engine with DuckDB and MotherDuck

Maddie from MotherDuck discussed the process of building a real-time recommendation engine using DuckDB and MotherDuck.

Traditional Recommendation Engine Challenges: Building recommendation engines often involves complex infrastructure and batch processing.
Leveraging DuckDB for Real-Time Analysis: DuckDB’s speed and efficiency enable real-time analytical processing directly on data.
Simplified Architecture with MotherDuck: MotherDuck provides a serverless platform that simplifies the deployment and management of such applications.
Example - Movie Recommendation Engine: Maddie presented a conceptual example of a movie recommendation engine.
- User interactions (e.g., likes, watches, etc.) are streamed into MotherDuck.
- DuckDB is used to perform real-time analysis of user behavior and movie features.
- Recommendations are generated based on this analysis and served in real-time.
Playground for End Users: DuckDB offers an excellent sandbox environment for end users to explore and experiment with data.

This webinar provided valuable insights into the evolving landscape of data warehousing with MotherDuck and showcased practical applications of AI, particularly LLMs, in enhancing developer productivity, improving data quality, and building real-time applications. The lightning talks highlighted tangible benefits and strategies for leveraging these technologies in modern data workflows.

Watch the Webinar by MotherDuck

References

MotherDuck at YouTube [ 4 Lightning Talks on Practical AI Workflows from Notion, 1Password, MotherDuck & Evidence (original) (archived) ]