
Hosted by Ronald Soh · EN

This paper addresses the challenges associated with adapting Large Language Models (LLMs) for various tasks within the e-commerce domain using prompting techniques. While prompting offers an efficient alternative to fine-tuning, it often requires significant manual effort from domain experts for prompt engineering and frequent updates to align with evolving business needs. Furthermore, crafting truly unbiased natural language prompts and selecting representative in-context examples remain difficult for humans. The authors propose a novel framework called Examples as the Prompt (EaP). This approach leverages labelled data to enhance prompts by automatically selecting the most representative examples to maximise the few-shot learning capabilities of LLMs. EaP is designed to be efficient due to its unsupervised example selection and adaptive to potential data distribution shifts.

This paper addresses a core challenge in aligning large language models (LLMs) with human preferences: the substantial data requirements and technical complexity of current state-of-the-art methods, particularly Reinforcement Learning from Human Feedback (RLHF). The authors propose a novel approach based on inverse reinforcement learning (IRL) that can learn alignment directly from demonstration data, eliminating the need for explicit human preference data required by traditional RLHF methods. This research presents a significant step towards simplifying the alignment of large language models by demonstrating that high-quality demonstration data can be effectively leveraged to learn alignment without the need for explicit and costly human preference annotations. The proposed IRL framework offers a promising alternative or complementary approach to existing RLHF methods, potentially reducing the data burden and technical complexities associated with preference collection and reward modelling.

This paper critically examines the use of multiple-choice question (MCQ) benchmarks to assess the medical knowledge and reasoning capabilities of Large Language Models (LLMs). The central argument is that high performance by LLMs on medical MCQs may be an overestimation of their true medical understanding, potentially driven by factors beyond genuine knowledge and reasoning. The authors propose and utilise a novel benchmark of paired free-response and MCQ questions (FreeMedQA) to investigate this hypothesis. This study provides compelling evidence that performance on medical MCQ benchmarks may not be a reliable indicator of the true medical knowledge and reasoning abilities of LLMs. The significant performance drop in free-response questions, coupled with the above-chance MCQ accuracy even with completely masked questions, suggests that LLMs might be exploiting the structure of MCQs rather than demonstrating genuine understanding. The findings underscore the importance of developing and utilizing more rigorous evaluation methods, such as free-response questions, to accurately assess the potential and limitations of LLMs in medical applications.

This paper investigates the impact of Generative Artificial Intelligence (GAI), such as ChatGPT, Kimi, and Doubao, on students' learning across four grade levels (high school sophomores and juniors, university juniors and seniors) in six key areas collectively termed LIPSAL: learning interest, independent learning, problem-solving, self-confidence, appropriate use, and learning enjoyment. The study employed a hybrid-survey method combining questionnaires and group interviews. Key findings indicate that GAI has a generally positive impact on all LIPSAL aspects, with the most significant influence on 'appropriate use' and 'independent learning', and the least on 'learning interest' and 'self-confidence'. University students reported a higher level across all LIPSAL aspects compared to high school students. Students hold a positive attitude towards GAI and are willing to use it, recognising its potential while also acknowledging challenges related to accuracy, over-dependence, and ethical considerations.

This document summarises the key findings and insights from the NeurIPS 2023 Large Language Model (LLM) Efficiency Fine-tuning Competition. The competition aimed to democratise access to state-of-the-art LLMs by challenging participants to fine-tune a pre-trained model within a tight 24-hour timeframe on a single GPU. The analysis of the competition reveals a significant trend towards benchmark overfitting, highlighting the limitations of current evaluation methods. Notably, top-performing submissions prioritised data curation and the use of standard open-source libraries over custom model architectures. The competition also underscored the importance of software quality and reproducibility in the machine learning community. The organisers have released all competition entries and evaluation infrastructure to facilitate further research in this area.

This briefing document reviews the main themes and important ideas presented in Krti Tallam's paper on Orchestrated Distributed Intelligence (ODI). The paper argues for a paradigm shift in the field of Agentic AI, moving away from the development of isolated autonomous agents towards the creation of integrated, orchestrated systems of agents that work collaboratively with human workflows. ODI is presented as a novel approach that combines systems theory with AI capabilities, aiming to bridge the gap between artificial and human intelligence and transition organisations from static systems of record to dynamic systems of action.

This briefing document reviews the main themes and important ideas presented in the research paper "MoonCast: High-Quality Zero-Shot Podcast Generation". The paper introduces MoonCast, a novel system designed to generate natural, multi-speaker podcast-style speech from text-only sources using the voices of unseen speakers. The key innovation lies in addressing the challenges of long speech duration and spontaneity, which are limitations of many existing text-to-speech (TTS) systems.

This paper addresses the critical challenges of aligning superhuman artificial intelligence (AI) with human values, specifically focusing on scalable oversight and the dynamic nature of these values. The authors argue that existing approaches, such as recursive reward modelling, which aim for scalable oversight, often remove humans from the alignment loop entirely, failing to account for the evolving nature of human preferences. To counter this, the paper proposes a novel algorithmic framework inspired by Iterated Amplification. This framework trains a superhuman reasoning model to decompose complex tasks into subtasks that can be evaluated and solved by aligned human-level AI. The central assumption of this approach is the "part-to-complete generalization hypothesis," which posits that the alignment of subtask solutions will generalize to the alignment of the complete solution. The paper outlines the proposed algorithm, discusses methods for measuring and improving this generalization, and reflects on how this framework addresses key challenges in AI alignment.

The paper concludes by highlighting the introduction of MASFT as a "structured framework for understanding and mitigating MAS failures" and the development of a "scalable LLM-as-a-judge evaluation pipeline" for diagnosing failure modes. The intervention studies reveal that addressing MAS failures requires more than just simple fixes, paving a "clear roadmap for future research" focused on structural MAS redesigns. The open-sourcing of the dataset and LLM annotator further supports future work in this area. The authors note that "despite the growing interest in LLM agents, dedicated research on their failure modes is surprisingly limited," positioning their work as a "pioneering effort in studying failure modes in MASs" and underscoring the need for further research into robust evaluation metrics, common failure patterns, and effective mitigation strategies.

This paper addresses the challenge of automating high-stakes meta-review generation, a critical task in academic peer review that involves synthesizing conflicting evaluations and deriving consensus. The authors argue that current Large Language Model (LLM)-based methods for this task are underdeveloped and susceptible to cognitive biases like the anchoring effect and conformity bias, hindering their ability to effectively handle disagreements. To overcome these limitations, the paper introduces the Cognitive Alignment Framework (CAF), a novel dual-process architecture inspired by Kahneman's dual-process theory of human cognition. CAF employs a three-step cognitive pipeline: review initialisation, incremental integration, and cognitive alignment. Empirical validation on the PeerSum dataset demonstrates that CAF outperforms existing LLM-based methods in terms of sentiment and content consistency.