Imprecise Filtering
Users must manually construct complex filter syntax
LangStruct includes query parsing capabilities to decompose natural language queries into semantic terms and structured filters. This allows RAG systems to combine embedding-based search with metadata filtering.
Imprecise Filtering
Users must manually construct complex filter syntax
Semantic-Only Search
Can’t combine natural language with structured constraints
Technical Barriers
Non-technical users struggle with query DSLs
Manual Translation
Developers manually parse queries into filters
User queries aren’t monolithic - they contain fundamentally different types of information that require different handling. Understanding this is key to building effective RAG systems.
Consider this real query:
“Show me Q3 2024 financial reports from tech companies with revenue over $100B that mention AI investments”
This single query contains three distinct types of information:
Structural Filters
Exact constraints that should filter results:
These need database-style filtering, not semantic search
Semantic Content
Conceptual topics for similarity search:
These need embedding-based semantic search
Implicit Context
Assumed context from natural language:
These provide query understanding context
Traditional Approach: Embed the entire query and do similarity search
# Everything becomes one embedding vectorquery_embedding = embed("Show me Q3 2024 financial reports from tech companies...")results = vector_db.similarity_search(query_embedding)# Returns documents mentioning ANY of these concepts, not ALL
Problems:
What they are: Conceptual topics that benefit from semantic understanding
Examples:
How they work: Converted to embeddings for similarity matching
Best for:
What they are: Exact constraints that must be precisely matched
Examples:
How they work: Converted to database-style filter operations
Best for:
Let’s see how different queries naturally decompose:
“Find profitable tech companies from Q3 2024 discussing expansion plans”
["expansion plans", "discussing growth"]
{"quarter": "Q3 2024", "sector": "Technology", "profitable": true}
“Patient records over 65 years old with diabetes showing improvement”
["showing improvement", "better outcomes"]
{"patient_age": {"$gte": 65}, "diagnosis": "diabetes"}
“Best-selling electronics under $500 with 4+ star reviews available for shipping”
["best-selling", "popular items"]
{"category": "Electronics", "price": {"$lt": 500}, "rating": {"$gte": 4.0}, "in_stock": true}
By separating queries into components, you can:
LangStruct’s query() method uses LLM intelligence to automatically convert natural language into:
Important: LangStruct uses pure LLM-based parsing - no regex patterns or hardcoded rules. The LLM naturally understands comparisons like “over $100B”, temporal references like “Q3 2024”, and entity mentions, intelligently mapping them to your schema fields.
from langstruct import LangStruct
# Define your schema (same as document extraction!)schema_example = { "company": "Apple Inc.", "revenue": 125.3, # billions "quarter": "Q3 2024", "profit_margin": 23.1, # percentage "growth_rate": 15.2, "sector": "Technology"}
# Create LangStruct instancels = LangStruct(example=schema_example)
# Parse natural language queryquery = "Show me Q3 2024 tech companies with revenue over $100B"result = ls.query(query)
print("📝 Original query:", query)print("🔍 Semantic terms:", result.semantic_terms)print("🎯 Structured filters:", result.structured_filters)print("💯 Confidence:", f"{result.confidence:.1%}")print("📖 Explanation:", result.explanation)
Output:
📝 Original query: Show me Q3 2024 tech companies with revenue over $100B🔍 Semantic terms: ['tech companies']🎯 Structured filters: { 'quarter': 'Q3 2024', 'sector': 'Technology', 'revenue': {'$gte': 100.0}}💯 Confidence: 91.5%📖 Explanation:Searching for: tech companiesWith filters: • quarter = Q3 2024 • sector = Technology • revenue ≥ 100.0
LangStruct handles sophisticated query patterns automatically:
# Comparative queriesresult = ls.query("Companies with margins above 20% and declining growth")# Filters: {"profit_margin": {"$gte": 20.0}, "growth_rate": {"$lt": 0}}
# Multiple entitiesresult = ls.query("Apple or Microsoft Q3 2024 financial results")# Filters: {"company": {"$in": ["Apple Inc.", "Microsoft"]}, "quarter": "Q3 2024"}
# Range queriesresult = ls.query("Mid-size companies between $10B and $50B revenue")# Filters: {"revenue": {"$gte": 10.0, "$lte": 50.0}}
# Temporal referencesresult = ls.query("Recent quarterly reports from profitable companies")# Filters: {"quarter": "Q3 2024", "profit_margin": {"$gt": 0}}
Every ParsedQuery
includes an explanation
string that summarizes how the query was parsed (semantic terms and filters). Use it for debugging and UX.
ParsedQuery.explanation
provides a human-readable breakdown of the parsed query. You can render it directly in UIs for transparency.
from langstruct import LangStructfrom langchain_community.vectorstores import Chromafrom langchain_community.embeddings import OpenAIEmbeddings
class EnhancedRAGSystem: def __init__(self, schema_example): # Same schema for both extraction and parsing! self.langstruct = LangStruct(example=schema_example) self.vectorstore = Chroma(embedding_function=OpenAIEmbeddings())
def index_document(self, text: str): """Extract metadata and index document""" # Extract structured metadata extraction = self.langstruct.extract(text)
# Index with both text and metadata self.vectorstore.add_texts( texts=[text], metadatas=[extraction.entities] )
def natural_query(self, query: str, k: int = 5): """Query using natural language""" # Parse query into components parsed = self.langstruct.query(query)
# Perform hybrid search results = self.vectorstore.similarity_search( query=' '.join(parsed.semantic_terms), k=k, filter=parsed.structured_filters )
return results, parsed.explanation
# Usagerag = EnhancedRAGSystem(schema_example={ "company": "Example Corp", "revenue": 50.0, "quarter": "Q3 2024", "sector": "Technology"})
# Index documents (structured extraction)rag.index_document("Apple reported Q3 2024 revenue of $125.3B...")rag.index_document("Microsoft Q3 2024 earnings showed $62.9B revenue...")
# Query naturally (structured parsing)results, explanation = rag.natural_query( "Q3 2024 tech companies with revenue over $60B")
print(f"Found {len(results)} matching documents")print(f"Query interpretation: {explanation}")
financial_ls = LangStruct(example={ "company": "Apple Inc.", "quarter": "Q3 2024", "revenue": 125.3, "profit_margin": 23.1, "eps": 1.53, "guidance": "positive"})
queries = [ "Q3 earnings beats with positive guidance", "Companies missing revenue estimates", "Tech giants with EPS above $1.50", "Declining margins in Q3 2024"]
for q in queries: result = financial_ls.query(q) print(f"Query: {q}") print(f"Filters: {result.structured_filters}\n")
medical_ls = LangStruct(example={ "patient_age": 65, "diagnosis": "diabetes", "medication": "metformin", "severity": "moderate", "outcome": "improved"})
# Parse medical queriesresult = medical_ls.query( "Elderly diabetes patients on metformin with improved outcomes")# Filters: {# "patient_age": {"$gte": 65},# "diagnosis": "diabetes",# "medication": "metformin",# "outcome": "improved"# }
product_ls = LangStruct(example={ "category": "Electronics", "price": 999.99, "rating": 4.5, "brand": "Apple", "in_stock": True})
# Parse shopping queriesresult = product_ls.query( "Apple electronics under $500 with 4+ star ratings in stock")# Filters: {# "brand": "Apple",# "category": "Electronics",# "price": {"$lt": 500.0},# "rating": {"$gte": 4.0},# "in_stock": True# }
from chromadb import Clientfrom langstruct import LangStruct
# Setupclient = Client()collection = client.create_collection("documents")ls = LangStruct(example=your_schema)
# Query with natural languagedef smart_search(query: str): parsed = ls.query(query)
results = collection.query( query_texts=parsed.semantic_terms, where=parsed.structured_filters, n_results=10 )
return results
import pineconefrom langstruct import LangStruct
# Setuppinecone.init(api_key="your-api-key")index = pinecone.Index("your-index")ls = LangStruct(example=your_schema)
# Natural language querydef pinecone_search(query: str): parsed = ls.query(query)
# Convert to Pinecone filter format pinecone_filter = { f"metadata.{k}": v for k, v in parsed.structured_filters.items() }
results = index.query( vector=embed(parsed.semantic_terms), filter=pinecone_filter, top_k=10 )
return results
Always use the same schema for document extraction and query parsing:
# ✅ Good: Single instance for both operationsschema = {"company": "Apple", "revenue": 100.0, "quarter": "Q3"}ls = LangStruct(example=schema)# Use ls.extract() for documents and ls.query() for queries
# ❌ Bad: Different schemas for extraction and queriesextractor = LangStruct(example={"company": "Apple"})query_ls = LangStruct(example={"firm": "Apple"}) # Mismatch!
def safe_query(ls, query): try: result = ls.query(query) if result.confidence < 0.5: # Fall back to pure semantic search return {"semantic_only": True, "query": query} return result.structured_filters except Exception as e: logger.warning(f"Parse failed: {e}") return {"semantic_only": True, "query": query}
# Domain-specific instance with rich schemadomain_ls = LangStruct( example={ # Include all filterable fields "company": "Example Corp", "revenue": 50.0, "revenue_growth": 15.2, "profit_margin": 20.1, "quarter": "Q3 2024", "fiscal_year": 2024, "sector": "Technology", "market_cap": "Large Cap", # Include synonyms in descriptions "earnings": 10.5, # Also covers "profits", "income" },)# Call domain_ls.optimize(...) with training examples when ready
from functools import lru_cache
class CachedLangStruct: def __init__(self, schema): self.ls = LangStruct(example=schema)
@lru_cache(maxsize=1000) def query_cached(self, query: str): """Cache frequently used queries""" return self.ls.query(query)
# Process multiple queries efficientlyqueries = [ "Q3 2024 tech companies over $100B", "Healthcare companies with positive growth", "Financial services declining margins"]
# Parse all at onceresults = [ls.query(q) for q in queries]
# Or with parallel processingfrom concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=5) as executor: results = list(executor.map(ls.query, queries))
# Users must write complex filter syntaxresults = vectorstore.search( query="technology financial performance", filter={ "$and": [ {"quarter": {"$eq": "Q3 2024"}}, {"revenue": {"$gte": 100000000000}}, {"sector": {"$eq": "Technology"}} ] })
# Users write natural languageresults = enhanced_rag.search( "Q3 2024 tech companies with revenue over $100B")# Filters automatically generated!
Complete Example
See the full bidirectional RAG example with query parsing
RAG Integration
Learn about complete RAG enhancement
API Reference
Explore the LangStruct API
Optimization
Optimize query parsing for your domain
LangStruct’s query() method completes the bidirectional RAG enhancement:
Transform your RAG system from fuzzy search to precision retrieval with LangStruct!