What MIPROv2 Does
- Rewrites prompts to be more effective for your specific data
- Selects examples that best demonstrate the extraction task
- Tests combinations of instructions and demonstrations
- Iteratively improves based on your training data
LangStruct uses DSPy 3.0 as its foundation for optimized text extraction. DSPy provides programmatic interfaces for language models that can be automatically optimized instead of manually tuned.
DSPy signatures define structured interfaces for language model tasks:
class ExtractPerson(dspy.Signature): """Extract person information from text.""" text: str = dspy.InputField() name: str = dspy.OutputField() age: int = dspy.OutputField()
Modules compose signatures into reusable components:
extraction_module = dspy.ChainOfThought(ExtractPerson)result = extraction_module(text="John Smith is 25 years old")
MIPROv2 (Multi-stage Instruction Proposal and Refinement Optimization) is DSPy’s advanced optimization algorithm that automatically improves your extraction pipeline:
What MIPROv2 Does
What You Get
optimizer = dspy.MIPROv2(metric=accuracy_metric)optimized_module = optimizer.compile( student=extraction_module, trainset=training_examples)
LangStruct implements a modular extraction pipeline using DSPy components:
Manual Approach
prompt = f"Extract data from: {text}"result = llm.complete(prompt)
LangStruct with DSPy
class ExtractData(dspy.Signature): text: str = dspy.InputField() data: dict = dspy.OutputField()
extractor = dspy.ChainOfThought(ExtractData)
class ExtractionPipeline(dspy.Module): def __init__(self): super().__init__() self.chunk = dspy.ChainOfThought(ChunkText) self.extract = dspy.ChainOfThought(ExtractFromChunk) self.validate = dspy.ChainOfThought(ValidateExtraction) self.aggregate = dspy.ChainOfThought(AggregateResults)
def forward(self, document): chunks = self.chunk(text=document) extractions = [self.extract(chunk=c) for c in chunks] validated = [self.validate(data=e) for e in extractions] return self.aggregate(results=validated)
LangStruct integrates MIPROv2 seamlessly - just call optimize()
and it handles the rest:
from langstruct import LangStruct
# 1. Create extractor with your schemaextractor = LangStruct(example={ "company": "Apple", "revenue": 100.0, "quarter": "Q3"})
# 2. Let MIPROv2 optimize prompts and examples automaticallyextractor.optimize( texts=["Apple reported $125B in Q3...", "Meta earned $40B..."], expected_results=[ {"company": "Apple", "revenue": 125.0, "quarter": "Q3"}, {"company": "Meta", "revenue": 40.0, "quarter": "Q3"} ])
# 3. Now it's optimized for your specific data!result = extractor.extract("Microsoft announced $65B revenue for Q4")
Behind the scenes, MIPROv2:
The Problem: Traditional extraction systems break when you change models. Prompts tuned for older GPT-4-era models fail on Claude or the GPT-5 lineup. Hand-crafted examples optimized for one model perform poorly on another.
The Solution: DSPy + LangStruct automatically re-optimize for any model:
# Start with OpenAIextractor = LangStruct( example={"company": "Apple", "revenue": 100.0}, model="gpt-5-mini",)extractor.optimize(texts=training_texts, expected_results=expected_results)
# 6 months later, switch to Claude - just two lines!extractor.model = "claude-3-7-sonnet-latest"extractor.optimize(texts=training_texts, expected_results=expected_results) # Auto-reoptimizes prompts
# Or use local models for privacyextractor.model = "ollama/llama3.2"extractor.optimize(texts=training_texts, expected_results=expected_results) # Works the same way
# Same accuracy, zero prompt rewriting, zero vendor lock-in
Benefits:
LangStruct extends DSPy signatures to include source tracking:
class ExtractWithSources(dspy.Signature): """Extract structured data with source attribution.""" text: str = dspy.InputField() data: str = dspy.OutputField(desc="JSON formatted extraction") sources: str = dspy.OutputField(desc="Character spans for each field")
result = extractor.extract(text)# result.data contains structured information# result.sources contains character-level attribution
DSPy optimization can improve LangStruct performance: