Artificial intelligence and scientific research: this post includes the analysis of our amazing AI interns, who looked at the opportunities and challenges of using LLMs for research ideas generation.
Comment #1 – Rafig
The paper by Si and Yang (2024) provides a very interesting insight into the future of AI-based research, especially with respect to the increasing capabilities of LLMs towards influencing even fields like NLP. Indeed, a person must be impressed by the fantastic improvement that is being seen in the space of generating new ideas and newer insights. It is meanwhile important to take a step back and gauge more critical impacts of LLMs, specifically on human creativity and scientific inquiry. This reflection will balance novelty and feasibility, discuss the indispensable role of human oversight, pose some questions of intellectual property, and consider some risks in over-reliance on AI.
Novelty vs. Feasibility: Hitting the Right Balance
Perhaps the most intriguing finding in this study is that research ideas generated by LLMs are considered more novel than those proposed by human experts. This would mean a quantum leap in the creative capability of AI. However, novelty does not always translate into value-especially in an area where the practicality and feasibility of an idea may play a major role in its success.
Human researchers bring a deep level of understanding about the constraints on implementation that are really not captured by large language models. The areas include technical feasibility, ethical considerations, and how new ideas fit within extant structures. An LLM finds ideas through pattern completions in large datasets; they do not have practical knowledge or experience to help them assess whether a concept is feasible or pertinent in current technological and ethical realities. This means that while the LLMs indeed can create novelty, most of these ideas are either unworkable or well ahead of what is practical today.
It is at this point that the trade-off between novelty and feasibility becomes critical. While LLMs can extend the horizon of possibility, human judgment is still called for in sieving through these ideas and filtering out those that are viable. However, LLMs are just incredibly powerful means of generating new ideas, and little else, and no effective substitute yet exists for human judgment in critically evaluating those ideas to decide if they are worthy of further pursuit. When does an idea become too novel or unworkable, and under what conditions should human researchers override a potentially innovative yet not actionable idea?
Human Oversight: An Indispensable Role
On the flip side, this work underlines one significant weakness: LLMs cannot assess for quality or feasibility of their ideas. This limitation shows how the position of human judgment is irreplaceable during the process of research. Human researchers contribute experience, intuition, and a delicate comprehension of complex systems—qualities that AI, despite all its rapid progress, cannot emulate.
While LLMs could do much to create a lot of ideas in the least amount of time, not all those ideas will be relevant or useful. Here comes the entry of the human researcher with his contribution of critical thinking and subject matter expertise through filtering the ideas to retrieve those worth developing. In this hybrid model of research, humans and AI each bring their unique strengths. That’s where AI brings creativity and wide-angle perspectives, while humans add intellectual rigor and practical insights to mold these ideas into substantial contributions.
The fact that human researchers need to co-operate with AI in no way diminishes the human researcher’s role but, on the contrary, complements it. Researchers will need to acquire newly developed skills in effective interaction with AI-to elaborate on prompts, curate outputs, and place AI-generated ideas within the wider framework of the research at hand.
Intellectual Property and Recognition
The increasing role of AI in research also brings up intellectual property and recognition questions. If a large language model develops an original idea, or even a full-fledged research output, who gets credited for that-the AI or the human researcher behind it? This issue will be increasingly complicated, as these AI tools are now capable of generating not just ideas themselves, but even complete research outputs by themselves. Does this person guiding the AI take all the credit, or does the machine deserve a bit of the credit?
I think that even with LLMs, the human researcher is still the driving force behind any new ideas. LLMs are instruments of creative thinking but would not work meaningfully without human leading as to what is relevant and how results should be doctored to make sense. It is the researcher who takes some production from AI and molds it into constructive research. While AI is due for some credit, it is the human researchers that deserve most of the credit, as they infuse this process with context and rigor.
Conclusion
This is a very exciting glimpse into how LLMs might conjure up new ideas for research. At the same time, it reveals limitations and some risks with relying too heavily on AI. Human oversight is important, however, in helping sift out those AI-generated ideas which are not only novel but feasible and impactful. The future of research must rest in the collaborative relationship between AI and human researchers where synergy is just pushed forward by both. AI is going to enhance human creativity, but it will never replace the attributes that make human discovery so powerful.
Comment #2 – Javier
In the paper Can LLMs Generate Novel Research Ideas? written by Chenglei Si, Diyi Yang, Tatsunori Hashimoto (from Stanford University), the authors address the issue of creativity in the field of LLMs (Large Language Models). What they did was to recruit 49 expert NLP (Natural Language Processing) researchers to write novel ideas for a paper on 7 NLP topics and compare them with the output from an LLM agent. The evaluation depended on 79 independent experts (PhDs and postdocs) who blindly reviewed all the ideas using a form inspired by ICLR & ACL (prestigious conferences in the field of computer science), including scores such as novelty, feasibility or expected effectiveness. The human researchers had a monetary incentive ($300 per idea and $1000 bonus for the top 5 human ideas) and all the proposals were formatted and standardized using an LLM so that it wasn’t a decisive factor.
With this setup, they built an LLM agent which was more complex than simply asking ChatGPT to write a novel paper: they created a whole pipeline in which a RAG system would let the model do a literature review related to the given research topic, then the LLM generates 4000 ideas and ranks them. It is remarkable that they tested the ranking system against real ICLR submissions and it agreed with humans about 70% (using Claude-3.5-Sonnet).
After all of this, what conclusions did the authors reach? LLM ideas are consistently rated as significantly more novel than human expert ideas. But this is not all there is to it, they also stated that LLMs lack diversity in idea generation, repeating previously generated ideas even though they were explicitly told not to; and LLMs cannot evaluate ideas reliably yet, since the ranking system showed low agreement with humans when evaluating the ideas produced during the study.
Although the scope of the study is very narrow as they only explored 7 topics from NLP, it reopens the debate of whether machines will surpass humans in creative tasks at some point or not. The question is very delicate because creativity is rooted in our identity as human beings, and until now, it has been unrivaled by any other species. Recent advancements such as image generation with diffusion models have brought the discussion to the mainstream world and social media is full of arguments from both sides. However, here we will not address the question of what can and cannot be considered art but rather the novelty capacity of LLMs’ generations.
To form an informed opinion, we first have to understand what the essence of LLMs is. They are a set of weights fixed after the training and arranged in a Transformer Architecture. For a given sequence of tokens (in this case words and parts of words), outputs a probability distribution of tokens where the next token is sampled from and the process is repeated to keep generating sentences and paragraphs. In this sense they are completely static: given the same sequence of tokens they will always produce the same probability distribution (although some stochasticity can be added in the sampling process). This is not the case for a human brain, which is more dynamic in the sense that the likelihood of outputs is not frozen and we are constantly ingesting new information, processing it and we are also conditioned by it to some extent.
So, are the outputs from LLMs not novel since they are essentially deterministic? Not really. The situation is that the sequence of tokens fed into the model can steer it into repeating itself over and over again (if it is too vague) or finding surprising insights (if it is more detailed). Recent approaches with Chain Of Thought (like O1 from OpenAI) try to exploit this fact. They use Reinforcement Learning to teach the model how to explore those “useful” sequences that lead the model to produce better outputs. Also, RAG systems provide the model with new information unseen in the pre-training that is relevant to the task, thus patching the “fixed knowledge” issue intrinsic to the tool by design.
To summarize, although in principle the model by itself is not dynamic in nature, if it has correctly learned the abstract concepts from the words (so that it could generalize them to novel situations) and given some way to interact with the environment (update the knowledge base, or even do experiments to test its own hypotheses), the outcome can potentially be novel by conditioning the model with a certain prompt that does not necessarily need a human in the loop. Further research in this area is still necessary, and exploring new architectures may yield better solutions for this task; therefore, the topic remains open for investigation.
Comment #3 – Samat
Large Language Models (LLMs) have gained big attention due to their ability to solve tasks such as writing text, analyzing data or generating scientific ideas. There is a lot of debate whether LLMs can outperform human experts in generating original and high-quality scientific proposals. The article “Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers”, written by Chenglei Si, Diyi Yang and Tatsunori Hashimoto of Stanford University, represents an important step towards answering this question. The authors performed a large-scale study comparing ideas proposed by experts in natural language processing (NLP) with those generated by LLMs. While the conclusions of the article are noteworthy, I think the results need further analysis, and the possible limitations of the experiment raise important questions about the future of using LLM in scientific studies.
Novelty of ideas as the main advantage of LLM
One of the key takeaways from the article is that the ideas proposed by the LLMs were rated as more innovative than the ideas of the experts. This is a notable achievement, because novelty is one of the fundamental criteria in scientific research. It makes sense to assume that LLMs trained on huge amounts of data are able to combine and connect information in a way that can lead to new ideas. This is supported by the statistically significant results of the study, where LLM ideas were rated higher on the novelty criterion.
However, this result raises certain questions. For example, the article notes that the novelty of ideas could be a result of the fact that LLMs generated ideas that experts are not inclined to suggest, even though they might be useful. This may indicate that LLMs, with their ability to generate unexpected and less traditional ideas, foster creativity. However, it is doubtful how far such ideas can really form the basis for successful research. Novelty alone does not guarantee usefulness or practicality.
The problem of feasibility and detailing of ideas
Although the LLMs showed high scores in terms of novelty, their ideas were ranked lower in terms of feasibility and detail. The paper noted that many of the ideas proposed by LLMs were either too general or lacked sufficiently well-defined detail to be implementable in practice. This is an important limitation that indicates that LLMs can generate potentially interesting but insufficiently developed ideas.
I think this is the key challenge that LLMs face on their way to becoming meaningful research assistants. To make ideas useful, they need to be not only new, but also feasible.Thus, LLMs can generate interesting starting ideas, but it will take considerable refinement by human experts to make them usable in real-world projects.
Limitations and opportunities for improvements
One of the major limitations of the study is the problem of insufficient diversity of generated LLM ideas. The authors note that as the number of generated ideas increases, the percentage of unique proposals decreases significantly. This indicates that there is a definite limit to the ability of LLMs to generate a variety of ideas.Addressing this problem may require more advanced methods of filtering or processing ideas, as well as improved algorithms to provide more creativity and uniqueness.
Furthermore, the article points out important problems with LLM self-validation. The authors highlight that LLMs do a poor job of evaluating their own ideas. This represents a serious limitation, since in order to create effective autonomous scientific agents it is important that they can not only generate but also evaluate ideas. Otherwise, there is a risk that many of the proposed ideas will be either too simple or unrealizable.
Ethical issues and the future of research
The article raises important ethical issues related to the use of LLMs in scientific research. One such issue is the question of intellectual authorship. If an LLM generates research ideas, who should be recognized for their creation? This calls for a review of existing norms and standards of scholarly activity.
In addition, the use of LLMs may lead to undesirable consequences, such as generating ideas that can be used for disreputable purposes. For example, LLMs may suggest ways to carry out cyberattacks or other malicious activities, raising questions about the security and oversight of the use of such systems.
However, despite these potential problems, I believe that LLMs can be useful tools for accelerating scientific discovery. This is especially true in the early stages of research, where it is important to generate a large number of ideas for later analysis and refinement. The combined use of LLMs and experts can lead to faster and more efficient creation of new scientific solutions.
The article “Can LLMs Generate Novel Research Ideas?” represents an important step in exploring the potential of LLMs for scientific endeavors. Although the results show that LLMs are capable of generating new and interesting ideas, problems with their feasibility and lack of detail indicate that they still have a long way to go before they can become full scientific agents. However, LLMs have the potential to become a useful tool for researchers, especially when combined with human supervision and refinement. In the future, studies in this area should focus on improving the self-validation of LLMs, as well as exploring the ethical and practical aspects of their use.