Published inData Science CollectiveShould I Use the Same Model for My Eval as My Agent?Examining Self-Evaluation Bias of Major LLMsOct 8Oct 8
Published inData Science CollectiveLLM-as-a-Judge: When to Use Reasoning, CoT, and ExplanationsA guide on what to use where — and whyAug 20Aug 20
Published inArize AIWhat Is Prompt Learning?Prompts, like models, should improve with feedback — not stay static.Aug 1A response icon2Aug 1A response icon2
Published inTDS ArchiveChoosing Between LLM Agent FrameworksThe tradeoffs between building bespoke code-based agents and the major agent frameworks.Sep 21, 2024A response icon26Sep 21, 2024A response icon26
Published inTDS ArchiveNavigating the New Types of LLM Agents and ArchitecturesThe failure of ReAct agents gives way to a new generation of agents — and possibilitiesAug 30, 2024A response icon9Aug 30, 2024A response icon9
Published inTDS ArchiveEvaluating SQL Generation with LLM as a JudgeResults point to a promising approachJul 31, 2024Jul 31, 2024
Published inTDS ArchiveLarge Language Model Performance in Time Series AnalysisHow do major LLMs stack up at detecting anomalies or movements in the data when given a large set of time series data within the context…May 1, 2024A response icon3May 1, 2024A response icon3
Published inTDS ArchiveTips for Getting the Generation Part Right in Retrieval Augmented GenerationResults from experiments to evaluate and compare GPT-4, Claude 2.1, and Claude 3.0 OpusApr 6, 2024Apr 6, 2024
Published inTDS ArchiveModel Evaluations Versus Task EvaluationsUnderstanding the difference for LLM applicationsMar 26, 2024Mar 26, 2024
Published inTDS ArchiveWhy You Should Not Use Numeric Evals For LLM As a JudgeTesting major LLMs on how well they conduct numeric evaluationsMar 8, 2024Mar 8, 2024