Research Efficiency
Process hundreds of papers 10x faster than manual review
Extract structured information from research papers, academic articles, and scientific literature including methodology, results, citations, and key findings for literature reviews, meta-analysis, and research synthesis.
Define comprehensive schemas for academic paper analysis:
from langstruct import LangStruct, Schema, Fieldfrom typing import List, Optional, Dictfrom datetime import datetime
class AuthorSchema(Schema): name: str = Field(description="Author's full name") affiliation: Optional[str] = Field(description="Author's institutional affiliation") email: Optional[str] = Field(description="Author's email address") orcid: Optional[str] = Field(description="ORCID identifier")
class CitationSchema(Schema): title: str = Field(description="Title of cited work") authors: List[str] = Field(description="Authors of cited work") journal: Optional[str] = Field(description="Journal or venue name") year: Optional[int] = Field(description="Publication year") doi: Optional[str] = Field(description="DOI identifier") citation_type: str = Field(description="Type of citation (supporting, contrasting, methodology)")
class MethodologySchema(Schema): study_design: str = Field(description="Research design or methodology type") sample_size: Optional[int] = Field(description="Number of subjects or samples") data_collection: str = Field(description="Data collection methods") analysis_methods: List[str] = Field(description="Statistical or analytical methods used") materials: List[str] = Field(description="Materials, tools, or instruments used") limitations: List[str] = Field(description="Study limitations acknowledged by authors")
class ResultSchema(Schema): finding: str = Field(description="Key finding or result") statistical_significance: Optional[str] = Field(description="Statistical significance (p-value, confidence interval)") effect_size: Optional[str] = Field(description="Effect size or magnitude") supporting_data: Optional[str] = Field(description="Supporting data or evidence")
class KeywordSchema(Schema): keywords: List[str] = Field(description="Author-provided keywords") research_domains: List[str] = Field(description="Research domains or fields") methodology_tags: List[str] = Field(description="Methodology-related tags")
class ScientificPaperSchema(Schema): title: str = Field(description="Paper title") authors: List[AuthorSchema] = Field(description="All authors") abstract: str = Field(description="Paper abstract") keywords: KeywordSchema = Field(description="Keywords and research domains") publication_info: Dict[str, str] = Field(description="Journal, volume, pages, DOI, publication date") research_question: str = Field(description="Main research question or hypothesis") methodology: MethodologySchema = Field(description="Research methodology and methods") key_results: List[ResultSchema] = Field(description="Main findings and results") conclusions: List[str] = Field(description="Author conclusions") future_work: List[str] = Field(description="Suggested future research directions") citations: List[CitationSchema] = Field(description="Key references cited") funding: Optional[str] = Field(description="Funding sources") conflicts_of_interest: Optional[str] = Field(description="Declared conflicts of interest")
Create an extractor for research paper analysis:
# Create the extractor optimized for academic contentextractor = LangStruct( schema=ScientificPaperSchema, model="gemini-2.5-flash", # Fast and reliable for academic content optimize=True, use_sources=True, # Track where information was found temperature=0.2, # Slightly higher for nuanced interpretation max_retries=3)
# Example research paper text (excerpt)paper_text = """Machine Learning Approaches for Climate Change Prediction: A Comparative Study
Authors: Dr. Sarah Chen¹, Prof. Michael Rodriguez², Dr. Lisa Wang¹¹Department of Environmental Science, Stanford University, Stanford, CA²Climate Research Institute, MIT, Cambridge, MA
Abstract:Climate change prediction remains one of the most critical challenges of our time.This study compares the effectiveness of three machine learning approaches—RandomForest (RF), Support Vector Machines (SVM), and Long Short-Term Memory networks(LSTM)—for predicting temperature anomalies using 50 years of global climate data.Our analysis of 10,000 weather stations worldwide shows that LSTM models achievethe highest accuracy (R² = 0.89, p < 0.001) compared to RF (R² = 0.78) and SVM(R² = 0.71). The study demonstrates that deep learning approaches can significantlyimprove climate prediction accuracy, with implications for policy and adaptation planning.
Keywords: climate change, machine learning, temperature prediction, LSTM, comparative analysis
1. IntroductionClimate prediction accuracy is crucial for informed policy decisions and adaptationstrategies (Hansen et al., 2016; IPCC, 2021). Traditional statistical models havelimitations in capturing complex non-linear relationships in climate systems(Smith & Johnson, 2020). This research addresses the question: "Which machinelearning approach provides the most accurate temperature anomaly predictions?"
2. MethodologyStudy Design: Comparative experimental design using historical climate dataSample: 50 years of daily temperature data from 10,000 weather stations (1973-2023)Data Collection: National Oceanic and Atmospheric Administration (NOAA) databaseAnalysis Methods:- Random Forest with 1000 trees- Support Vector Machine with RBF kernel- LSTM with 128 hidden units, 3 layers- 80/20 train-test split with 5-fold cross-validationMaterials: Python 3.9, TensorFlow 2.8, Scikit-learn 1.1Limitations: Limited to temperature data only, potential geographic bias
3. ResultsThe LSTM model achieved superior performance across all metrics:- LSTM: R² = 0.89 (95% CI: 0.87-0.91), RMSE = 0.34°C, p < 0.001- Random Forest: R² = 0.78 (95% CI: 0.75-0.81), RMSE = 0.45°C- SVM: R² = 0.71 (95% CI: 0.68-0.74), RMSE = 0.52°C
Regional analysis showed LSTM improvements were consistent across all climate zones,with the largest gains in tropical regions (R² improvement of 0.15 over RF).
4. ConclusionsLSTM networks provide significantly better climate prediction accuracy thantraditional machine learning approaches. The 14% improvement in R² representsa substantial advance for climate modeling applications.
Future research should explore: (1) integration of satellite data, (2) ensemblemethods combining multiple ML approaches, (3) real-time prediction systems.
Funding: This research was supported by NSF Grant #2045789 and NASA Climate Grant #80NSSC21K1234.Conflicts: The authors declare no conflicts of interest.
References:Hansen, J., Sato, M., & Kharecha, P. (2016). Global temperature evolution: Recenttrends and some pitfalls. Journal of Climate, 29(4), 1265-1281.
IPCC. (2021). Climate Change 2021: The Physical Science Basis. Cambridge University Press.
Smith, A., & Johnson, B. (2020). Limitations of traditional climate models.Nature Climate Change, 10(8), 723-729."""
# Extract paper informationresult = extractor.extract(paper_text)
print("Scientific Paper Analysis:")print(f"Title: {result.entities.title}")print(f"Authors: {[author.name for author in result.entities.authors]}")print(f"Research Question: {result.entities.research_question}")print(f"Sample Size: {result.entities.methodology.sample_size:,}")print(f"Key Results: {len(result.entities.key_results)} findings")print(f"Citations: {len(result.entities.citations)} references")
# Show methodology detailsprint(f"\nMethodology:")print(f"Study Design: {result.entities.methodology.study_design}")print(f"Analysis Methods: {', '.join(result.entities.methodology.analysis_methods)}")
# Display key findingsprint(f"\nKey Findings:")for i, result_item in enumerate(result.entities.key_results, 1): print(f"{i}. {result_item.finding}") if result_item.statistical_significance: print(f" Statistics: {result_item.statistical_significance}")
class ClinicalTrialSchema(Schema): trial_phase: Optional[str] = Field(description="Clinical trial phase (I, II, III, IV)") participants: int = Field(description="Number of study participants") inclusion_criteria: List[str] = Field(description="Patient inclusion criteria") exclusion_criteria: List[str] = Field(description="Patient exclusion criteria") primary_endpoint: str = Field(description="Primary outcome measure") secondary_endpoints: List[str] = Field(description="Secondary outcome measures") adverse_events: List[str] = Field(description="Reported adverse events") ethics_approval: Optional[str] = Field(description="Ethics committee approval details")
class MedicalPaperSchema(ScientificPaperSchema): clinical_trial: Optional[ClinicalTrialSchema] = Field(description="Clinical trial details") medical_keywords: List[str] = Field(description="Medical Subject Headings (MeSH)") patient_population: str = Field(description="Target patient population") interventions: List[str] = Field(description="Medical interventions studied")
medical_extractor = LangStruct(schema=MedicalPaperSchema)
class AlgorithmSchema(Schema): name: str = Field(description="Algorithm name") complexity: Optional[str] = Field(description="Time/space complexity (Big O notation)") novelty: str = Field(description="Novel contribution or improvement") comparison_baseline: List[str] = Field(description="Algorithms compared against")
class CSPaperSchema(ScientificPaperSchema): algorithms: List[AlgorithmSchema] = Field(description="Algorithms presented") datasets: List[str] = Field(description="Datasets used for evaluation") evaluation_metrics: List[str] = Field(description="Performance metrics") code_availability: Optional[str] = Field(description="Code repository or availability") reproducibility: str = Field(description="Reproducibility information")
cs_extractor = LangStruct(schema=CSPaperSchema)
class EnvironmentalDataSchema(Schema): location: str = Field(description="Study location or geographic area") time_period: str = Field(description="Study time period") environmental_variables: List[str] = Field(description="Environmental factors measured") measurement_methods: List[str] = Field(description="Data collection instruments/methods") data_sources: List[str] = Field(description="Data sources (satellites, sensors, surveys)")
class EnvironmentalPaperSchema(ScientificPaperSchema): environmental_data: EnvironmentalDataSchema = Field(description="Environmental study data") ecological_impact: List[str] = Field(description="Ecological implications") policy_implications: List[str] = Field(description="Policy recommendations")
env_extractor = LangStruct(schema=EnvironmentalPaperSchema)
Extract information for systematic reviews:
class LiteratureReviewSchema(Schema): research_topic: str = Field(description="Main research topic") search_strategy: str = Field(description="Literature search methodology") inclusion_criteria: List[str] = Field(description="Study inclusion criteria") exclusion_criteria: List[str] = Field(description="Study exclusion criteria") studies_included: int = Field(description="Number of studies included") quality_assessment: str = Field(description="Study quality assessment method") synthesis_method: str = Field(description="Data synthesis approach") main_findings: List[str] = Field(description="Key meta-analysis findings") heterogeneity: Optional[str] = Field(description="Study heterogeneity assessment") bias_assessment: List[str] = Field(description="Publication bias assessment")
review_extractor = LangStruct(schema=LiteratureReviewSchema)
Extract and analyze citation patterns:
class CitationAnalysisSchema(Schema): total_citations: int = Field(description="Total number of citations") self_citations: int = Field(description="Number of self-citations") citation_patterns: List[str] = Field(description="Citation pattern analysis") key_references: List[CitationSchema] = Field(description="Most frequently cited works") citation_recency: str = Field(description="Age distribution of citations") interdisciplinary_scope: List[str] = Field(description="Disciplines of cited works")
def analyze_citations(paper_text: str): citation_extractor = LangStruct(schema=CitationAnalysisSchema) return citation_extractor.extract(paper_text)
# Extract citation patternscitation_analysis = analyze_citations(paper_text)print(f"Citation Analysis:")print(f"Total citations: {citation_analysis.entities.total_citations}")print(f"Key reference areas: {citation_analysis.entities.interdisciplinary_scope}")
Process multiple papers for systematic reviews:
from pathlib import Path
class LiteratureProcessor: def __init__(self, schema=ScientificPaperSchema): self.extractor = LangStruct(schema=schema) self.citation_analyzer = LangStruct(schema=CitationAnalysisSchema)
def process_paper_collection(self, papers_folder: Path): """Process a collection of papers for literature review""" paper_files = list(papers_folder.glob("*.txt")) + list(papers_folder.glob("*.pdf"))
# Prepare documents for batch processing paper_texts = [] file_names = []
for paper_file in paper_files: try: # Read paper content (handle both txt and pdf) if paper_file.suffix == '.pdf': paper_text = self.extract_pdf_text(paper_file) else: paper_text = paper_file.read_text(encoding='utf-8')
paper_texts.append(paper_text) file_names.append(paper_file.name)
except Exception as e: print(f"Error reading {paper_file}: {e}")
# Process all papers in batch paper_results = self.extractor.extract(paper_texts) citation_results = self.citation_analyzer.extract(paper_texts)
# Combine results results = [] for i, (paper_data, citation_data) in enumerate(zip(paper_results, citation_results)): results.append({ 'file': file_names[i], 'paper_data': paper_data, 'citation_analysis': citation_data, 'processed_at': datetime.now() }) print(f"Processed: {file_names[i]}")
return results
def extract_pdf_text(self, pdf_path: Path) -> str: """Extract text from PDF files (requires PyPDF2 or similar)""" # Implementation depends on your PDF processing library # Example with PyPDF2: # import PyPDF2 # with open(pdf_path, 'rb') as file: # reader = PyPDF2.PdfReader(file) # return ''.join([page.extract_text() for page in reader.pages]) pass
# Usage for systematic reviewprocessor = LiteratureProcessor()papers = processor.process_paper_collection(Path("./literature_review/"))
# Analyze collected datatotal_papers = len(papers)average_citations = sum(p['citation_analysis'].entities.total_citations for p in papers) / total_paperscommon_methods = {} # Analyze common methodologies across papers
Compare methodologies across multiple papers:
class MethodologyComparisonSchema(Schema): common_methods: List[str] = Field(description="Methodologies used across multiple studies") unique_approaches: List[str] = Field(description="Novel or unique methodological approaches") sample_size_range: str = Field(description="Range of sample sizes across studies") geographic_coverage: List[str] = Field(description="Geographic regions covered") temporal_coverage: str = Field(description="Time periods covered across studies") quality_indicators: List[str] = Field(description="Study quality indicators") gaps_identified: List[str] = Field(description="Research gaps identified")
def synthesize_literature(papers_data: List[dict]): """Synthesize findings from multiple papers""" # Combine all paper texts for comparative analysis combined_text = "\n\n--- PAPER SEPARATOR ---\n\n".join([ f"PAPER {i+1}: {paper['paper_data'].entities.title}\n{paper['paper_data'].original_text[:2000]}..." for i, paper in enumerate(papers_data) ])
synthesizer = LangStruct(schema=MethodologyComparisonSchema) return synthesizer.extract(combined_text)
# Perform literature synthesissynthesis = synthesize_literature(papers[:10]) # Analyze first 10 papersprint(f"Common methodologies: {synthesis.entities.common_methods}")print(f"Research gaps: {synthesis.entities.gaps_identified}")
Generate comprehensive research analysis reports:
from langstruct.core.export_utils import ExportUtilitiesimport pandas as pd
class ResearchReportGenerator: def generate_literature_review_report(self, papers_data: List[dict], output_path: str): """Generate comprehensive literature review report using LangStruct's export utilities"""
# Create summary statistics summary_stats = { 'total_papers': len(papers_data), 'author_count': self.count_unique_authors(papers_data), 'methodology_distribution': self.analyze_methodologies(papers_data), 'citation_statistics': self.analyze_citations(papers_data) }
# Export each paper's data to JSON for i, paper in enumerate(papers_data): result = paper['paper_data'] ExportUtilities.save_json(result, f"{output_path}_paper_{i:03d}.json")
# Create DataFrame with paper summaries papers_df = self.create_papers_dataframe(papers_data)
# Save to CSV (if pandas available) try: papers_df.to_csv(f"{output_path}_papers_summary.csv", index=False) print(f"Literature review data exported:") print(f" - Individual papers: {output_path}_paper_*.json") print(f" - Summary: {output_path}_papers_summary.csv") except ImportError: print("Pandas not available for CSV export") print(f"Individual paper data exported to: {output_path}_paper_*.json")
def create_papers_dataframe(self, papers_data: List[dict]) -> pd.DataFrame: """Create pandas DataFrame from papers data""" rows = [] for paper in papers_data: data = paper['paper_data'].entities rows.append({ 'Title': data.title, 'Authors': '; '.join([a.name for a in data.authors]), 'Journal': data.publication_info.get('journal', 'N/A'), 'Year': data.publication_info.get('year', 'N/A'), 'Sample_Size': data.methodology.sample_size if data.methodology else 'N/A', 'Study_Design': data.methodology.study_design if data.methodology else 'N/A', 'Key_Results_Count': len(data.key_results), 'Citations_Count': len(data.citations), 'DOI': data.publication_info.get('doi', 'N/A') }) return pd.DataFrame(rows)
# Usagereport_generator = ResearchReportGenerator()report_generator.generate_literature_review_report(papers, "climate_change_ml_review")
Research Efficiency
Process hundreds of papers 10x faster than manual review
Comprehensive Analysis
Extract methodology, results, citations, and key findings systematically
Source Tracking
Maintain complete traceability to original paper sources
Systematic Reviews
Perfect for meta-analysis and systematic literature reviews
# Domain-specific prompts for different research areasmedical_prompt = """You are analyzing a medical research paper. Focus on:1. Clinical significance and statistical significance2. Patient safety and adverse events3. Study design quality and potential biases4. Generalizability to broader patient populations5. Clinical implications and practice recommendations6. Regulatory considerations and approval pathways"""
medical_extractor = LangStruct( schema=MedicalPaperSchema, system_prompt=medical_prompt, model="gemini-2.5-flash")
class StudyQualitySchema(Schema): study_quality_score: float = Field(description="Overall quality score (0-10)") bias_risk: str = Field(description="Risk of bias assessment (low/moderate/high)") methodological_rigor: List[str] = Field(description="Methodological strengths") limitations: List[str] = Field(description="Study limitations and weaknesses") reproducibility_score: float = Field(description="Reproducibility assessment (0-10)") ethical_considerations: str = Field(description="Ethical approval and considerations")
quality_assessor = LangStruct(schema=StudyQualitySchema)
Ready to start analyzing scientific literature?
Start with Examples
Try the sample schemas with your research papers
Customize for Your Field
Adapt schemas for your specific research domain
Build Review Workflows
Create systematic review and quality assessment processes
Scale Your Research
Process entire research domains with automated analysis