LangChain4j: Chat With Documents
In this blog, you will take a closer look at how you can chat with your documents using LangChain4j and LocalAI and learn some basics about prompt engineering.
Join the DZone community and get the full member experience.
Join For FreeIn this blog, you will take a closer look at how you can chat with your documents using LangChain4j and LocalAI. Besides that, you will learn some basics about prompt engineering. Enjoy!
Introduction
In a previous post, chat with documents using LangChain4j and LocalAI was briefly discussed. In this blog, you will take a closer look at the capabilities of this functionality. You will do so by means of two Wikipedia documents, which will serve as a source for the documents. You will use the discography and list of songs recorded by Bruce Springsteen. The interesting part of these documents is that they contain facts and are mainly in a table format. The goal of this blog is to find out whether the Large Language Model (LLM) will be able to answer the questions with correct answers.
You will also use prompt engineering techniques in order to get to the correct answer. I am not a prompt engineering expert, but I did find some good resources about it which I can recommend:
- Prompt engineering by OpenAI;
- Prompt engineering guide;
- Generative AI for beginners by Microsoft.
The sources used in this blog can be found on GitHub.
Prerequisites
The prerequisites for this blog are:
- Basic knowledge about what a Large Language Model is;
- Basic Java knowledge, Java 21 is used;
- Basic knowledge of prompt engineering; see the links in the introduction for more information;
- Basic knowledge of LangChain4j, see a previous blog;
- You need LocalAI if you want to run the examples, see a previous blog how you can make use of LocalAI. Version 2.2.0 is used for this blog.
Without Documents
First, you are going to ask some questions about the documents but without providing the documents. This way, it is easier to see the effect when the documents are provided.
Five questions are asked by means of the following code.
public static void main(String[] args) {
askQuestion("on which album was \"adam raised a cain\" originally released?");
askQuestion("what is the highest chart position of \"Greetings from Asbury Park, N.J.\" in the US?");
askQuestion("what is the highest chart position of the album \"tracks\" in canada?");
askQuestion("in which year was \"Highway Patrolman\" released?");
askQuestion("who produced \"all or nothin' at all?\"");
}
private static void askQuestion(String question) {
ChatLanguageModel model = LocalAiChatModel.builder()
.baseUrl("http://localhost:8080")
.modelName("lunademo")
.temperature(0.0)
.build();
String answer = model.generate(question);
System.out.println(answer);
}
Ensure that LocalAI is running and properly configured. Run the code, and the LLM answers as follows:
- The song “Adam Raised a Cain” was originally released on the album “Sticky Fingers” by The Rolling Stones in 1971.
This is obviously wrong; the correct answer is “Darkness on the Edge of Town.” - The highest chart position of “Greetings from Asbury Park, N.J.” in the US was #14 on the Billboard 200 chart.
This is wrong, the correct answer is #60. - The album “Tracks” by Metallica did not have a specific chart position in Canada as it was released as a box set and did not chart on any Canadian album charts.
This is wrong; the album is not from Metallica. However, it is a box set that did not chart in Canada. So, the answer is partially correct. - “Highway Patrolman” was released in 1951.
This is wrong; the correct answer is 1982. - The song “All or Nothin’ at All” was written by Jimmy Van Heusen and performed by Frank Sinatra.
This answer is partially correct. There exists a song with the same title which is performed by Frank Sinatra. It is, however, not written by Jimmy van Heusen, but it is written by Arthur Altman and Jack Lawrence.
The conclusion is that all answers are not correct. This can be due to the fact that not enough information is provided (Bruce Springsteen is never mentioned in the questions), but mainly because the LLM does not have the knowledge to answer the questions correctly and just makes things up.
Chat With Documents
Let’s see what happens when you add documents to the LLM which contain the correct answers. The code is changed accordingly. See a previous blog for an extensive explanation. In short, you add the documents, embed them in text segments of 500 characters, and eventually, you chain the embedded store with the language model and ask the question.
private static void askQuestion(String question) {
EmbeddingModel embeddingModel = new AllMiniLmL6V2EmbeddingModel();
EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
.documentSplitter(DocumentSplitters.recursive(500, 0))
.embeddingModel(embeddingModel)
.embeddingStore(embeddingStore)
.build();
Document springsteenDiscography = loadDocument(toPath("example-files/Bruce_Springsteen_discography.pdf"));
Document springsteenSongList = loadDocument(toPath("example-files/List_of_songs_recorded_by_Bruce_Springsteen.pdf"));
ingestor.ingest(springsteenDiscography, springsteenSongList);
ChatLanguageModel model = LocalAiChatModel.builder()
.baseUrl("http://localhost:8080")
.modelName("lunademo")
.temperature(0.0)
.timeout(Duration.ofMinutes(5))
.build();
ConversationalRetrievalChain chain = ConversationalRetrievalChain.builder()
.chatLanguageModel(model)
.retriever(EmbeddingStoreRetriever.from(embeddingStore, embeddingModel))
.build();
String answer = chain.execute(question);
System.out.println(answer);
}
Run the code, and the LLM answers as follows:
- The album on which “Adam Raised a Cain” was originally released is “Darkness on the Edge of Town” by Bruce Springsteen, which was released in 1978.
This starts well; the answer is correct. - The highest chart position of “Greetings from Asbury Park, N.J.” in the US is #60.
This answer is correct. - The highest chart position of the album “Tracks” in Canada is #1 on the Billboard Canadian Albums chart. This information can be found on the website of the Canadian Recording Industry Association (ARIA) and on the Billboard website.
This answer is not correct. The ARIA is the Australian Recording Industry Association, for Canada it is the CRIA. The problem with this answer is that the LLM is making up an answer and even refers to sources where it found the answer. This could be a convincing answer, although it is not correct. - The song “Highway Patrolman” was released in 1982 as part of Bruce Springsteen’s album “Nebraska”.
This answer is correct. Initially, I made a typo in Higway, and forgot an h. In this case, the LLM answered with:
The song “Highway Patrolman” was released in 1995 as part of Bruce Springsteen’s album “The Ghost of Tom Joad.”
This answer is not correct. It is correct, though, that “The Ghost of Tom Joad” was released in 1995.
The conclusion is that a minor typo can make a big difference. - The song “All or Nothin’ at All” was produced by Bruce Springsteen, Roy Bittan, and Chuck Plotkin and was released on the album Human Touch in 1992.
This answer is correct but not complete. The LLM should also have mentioned Jon Landau as the producer.
The conclusion is that 4 out of 5 answers are correct or nearly correct. One answer is completely wrong. But the result is quite amazing because also in this case, it is never mentioned that the questions were about Bruce Springsteen.
Chat With Documents With Chat Memory
Let’s see whether it is possible to receive a correct answer to question 3 (what is the highest chart position of the album “tracks” in Canada?). It is possible to instruct the LLM by means of System Messages. As mentioned in a previous post, these messages are:
UserMessage
: AChatMessage
coming from a human/user.AiMessage
: AChatMessage
coming from an AI/assistant.SystemMessage
: AChatMessage
coming from the system.
In order to add these System Messages, you need to add a ChatMemory
to the ConversationalRetrievalChain
. Also, you change the method signature by adding the list ChatMessages
you want to add. The changed code becomes the following:
private static void askQuestion(String question, List<ChatMessage> chatMessages) {
EmbeddingModel embeddingModel = new AllMiniLmL6V2EmbeddingModel();
EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();
...
ChatMemory chatMemory = MessageWindowChatMemory.builder()
.maxMessages(20)
.build();
for (ChatMessage chatMessage : chatMessages) {
chatMemory.add(chatMessage);
}
ConversationalRetrievalChain chain = ConversationalRetrievalChain.builder()
.chatLanguageModel(model)
.chatMemory(chatMemory)
.retriever(EmbeddingStoreRetriever.from(embeddingStore, embeddingModel))
.build();
String answer = chain.execute(question);
System.out.println(answer);
}
Attempt #1
In a first attempt, you instruct the LLM to only use the provided documents as a source. You also ask the LLM to simply say I could not find an answer if it could not find the answer in the provided documents.
ArrayList<ChatMessage> chatMessages = new ArrayList<>();
chatMessages.add(new SystemMessage("Use the provided documents Bruce_Springsteen_discography.pdf and List_of_songs_recorded_by_Bruce_Springsteen.pdf to answer the questions."));
chatMessages.add(new SystemMessage("If the answer cannot be found in the documents, write \"I could not find an answer\"."));
askQuestion("what is the highest chart position of the album \"tracks\" in canada?", chatMessages);
The LLM answers with:
Based on the information provided, the highest chart position of the album “Tracks” in Canada is not specified. However, according to Billboard’s Canadian Hot 100 chart, the album’s single “The Rising” peaked at number 29 in 2002.
This answer is correct, but you also receive some extra information that was not asked for and is unrelated to the question.
Attempt #2
Instruct the LLM to not provide any additional information.
chatMessages.add(new SystemMessage("Use the provided documents Bruce_Springsteen_discography.pdf and List_of_songs_recorded_by_Bruce_Springsteen.pdf to answer the questions."));
chatMessages.add(new SystemMessage("Do not provide any additional information."));
chatMessages.add(new SystemMessage("If the answer cannot be found in the documents, write \"I could not find an answer\"."));
askQuestion("what is the highest chart position of the album \"tracks\" in canada?", chatMessages);
The LLM answers with exactly the same answer as in attempt #1. This did not have any influence at all.
Attempt #3
The LLM was able to respond correctly to the album Tracks in Canada. Let’s try this again but for the album Greetings from Asbury Park, N.J., in Germany.
chatMessages.add(new SystemMessage("Use the provided documents Bruce_Springsteen_discography.pdf and List_of_songs_recorded_by_Bruce_Springsteen.pdf to answer the questions."));
chatMessages.add(new SystemMessage("If the answer cannot be found in the documents, write \"I could not find an answer\"."));
askQuestion("what is the highest chart position of the album \"greetings from Asbury Park, N.J.\" of Bruce Springsteen in germany?", chatMessages);
The LLM answers with:
Based on the information provided, the highest chart position of the album “Greetings from Asbury Park, N.J.” by Bruce Springsteen in Germany is not mentioned. However, according to the Billboard Canadian Hot 100 Chart Positions, the album peaked at number 10 in Canada.
This answer is correct, but also, here, non-related additional information is given.
Attempt #4
Will the LLM answer correctly in case of an album that has a chart position? Greetings from Asbury Park, N.J., was released in Australia and reached position #71. Let’s find out.
chatMessages.add(new SystemMessage("Use the provided documents Bruce_Springsteen_discography.pdf and List_of_songs_recorded_by_Bruce_Springsteen.pdf to answer the questions."));
chatMessages.add(new SystemMessage("If the answer cannot be found in the documents, write \"I could not find an answer\"."));
askQuestion("what is the highest chart position of the album \"greetings from Asbury Park, N.J.\" of Bruce Springsteen in Australia?", chatMessages);
The LLM answers with:
Based on the information provided, the highest chart position of the album “Greetings from Asbury Park, N.J.” of Bruce Springsteen in Australia was #42. This information can be found on the ARIA Charts website, which archives the original chart data from March 28, 2013. Additionally, the album was certified gold in Canada by Music Canada in 2017.
This answer is not correct. The additional information provided is also not correct. The document Bruce_Springsteen_discography.pdf
does mention on page 33 in reference 42 the information about the ARIA Charts website. It seems that the LLM is a bit confused about this and mixes some information.
The document states the following on page 33 in the references section:
- “ARIA Australian Top 50 Albums | Australia’s Official Top 50 Albums – ARIA Charts” (https://web.archive.org/
web/20130328105949/http://www.ariacharts.com.au/chart/albums/987). Archived from the original (http://ari
acharts.com.au/chart/albums/987) on March 28, 2013. Retrieved March 26, 2013.
Attempt #5
In the question, you asked for the highest chart position. However, the table containing the information says peak chart position. Let’s find out whether it makes a difference when you change the word high to peak.
chatMessages.add(new SystemMessage("Use the provided documents Bruce_Springsteen_discography.pdf and List_of_songs_recorded_by_Bruce_Springsteen.pdf to answer the questions."));
chatMessages.add(new SystemMessage("If the answer cannot be found in the documents, write \"I could not find an answer\"."));
askQuestion("what is the peak chart position of the album \"greetings from Asbury Park, N.J.\" of Bruce Springsteen in Australia?", chatMessages);
The LLM answers with:
The peak chart position of the album “Greetings from Asbury Park, N.J.” by Bruce Springsteen in Australia is not mentioned in the provided information. However, the album did reach the top 50 on the Australian Singles Chart in 1973.
Now, the LLM says that it cannot find the answer. And it adds some weird additional information that the album did reach the Singles Chart while it obviously is an album and not a single.
Attempt #6
One of the original questions that did provide a correct answer was: what is the highest chart position of the album Greetings from Asbury Park, N.J., in the US? Remember that the LLM responded with #60, which is the correct answer. Let’s ask it again with the System Messages.
chatMessages.add(new SystemMessage("Use the provided documents Bruce_Springsteen_discography.pdf and List_of_songs_recorded_by_Bruce_Springsteen.pdf to answer the questions."));
chatMessages.add(new SystemMessage("If the answer cannot be found in the documents, write \"I could not find an answer\"."));
askQuestion("what is the highest chart position of the album \"Greetings from Asbury Park, N.J.\" of Bruce Springsteen in the US?", chatMessages);
The LLM answers with:
Based on the information provided, the highest chart position of the album “Greetings from Asbury Park, N.J.” of Bruce Springsteen in the US was #67 on the Billboard 200 chart.
The answer is not correct anymore.
Even when removing the System Messages and asking the original question (without Bruce Springsteen in the question), the LLM now returns the wrong answer:
The highest chart position of the album “Greetings from Asbury Park, N.J.” in the US is #2.
Chat With Documents With Extra Information
What happens when you extract the information about the album Greetings from Asbury Park, N.J., in a separate document but formatted a bit differently? Add a document Bruce_Springsteen_chart_positions.txt
in the resources
directory with the following content.
The album "Greetings from Asbury Park, N.J." was not released in Germany, Canada, Ireland, Netherlands, New Zealand, Norway
The album "Greetings from Asbury Park, N.J." peaked at #60 in the US.
The album "Greetings from Asbury Park, N.J." peaked at #71 in Australia.
The album "Greetings from Asbury Park, N.J." peaked at #35 in Sweden.
The album "Greetings from Asbury Park, N.J." peaked at #41 in the UK.
Add the document as an extra resource.
private static void askQuestion(String question, List<ChatMessage> chatMessages) {
....
Document springsteenDiscography = loadDocument(toPath("example-files/Bruce_Springsteen_discography.pdf"));
Document springsteenSongList = loadDocument(toPath("example-files/List_of_songs_recorded_by_Bruce_Springsteen.pdf"));
Document extra = loadDocument(toPath("example-files/Bruce_Springsteen_chart_positions.txt"));
ingestor.ingest(springsteenDiscography, springsteenSongList, extra);
...
}
Ask the questions about Australia and Canada again.
public static void main(String[] args) {
ArrayList<ChatMessage> chatMessages = new ArrayList<>();
chatMessages.add(new SystemMessage("Use the provided documents to answer the questions."));
chatMessages.add(new SystemMessage("If the answer cannot be found in the documents, write \"I could not find an answer\"."));
askQuestion("what is the highest chart position of the album \"Greetings from Asbury Park, N.J.\" in Australia?", chatMessages);
chatMessages.clear();
chatMessages.add(new SystemMessage("Use the provided documents to answer the questions."));
chatMessages.add(new SystemMessage("If the answer cannot be found in the documents, write \"I could not find an answer\"."));
askQuestion("what is the highest chart position of the album \"Greetings from Asbury Park, N.J.\" in Canada?", chatMessages);
}
The LLM answers with:
Based on the information provided, the highest chart position of the album “Greetings from Asbury Park, N.J.” in Australia is #71.
and
Based on the information provided, the highest chart position of the album “Greetings from Asbury Park, N.J.” in Canada is #10.
The answer about Australia is correct; the answer about Canada is not correct.
If you remove both pdf files and only add the text file, the LLM answers correctly to both questions.
Based on the information provided, the highest chart position of the album “Greetings from Asbury Park, N.J.” in Australia is #71.
and
Based on the information provided, the highest chart position of the album “Greetings from Asbury Park, N.J.” in Canada is not specified. However, we do know that it did not reach the top 40 in Canada.
Incomplete Answer
The LLM responded correctly but incompletely to question 5: who produced “all or nothin’ at all”?
Let’s see whether a complete answer can be retrieved from the LLM by asking the following questions. Note that no chat memory is used here. Also note that the last three questions only differ slightly from each other: one with a question mark between double quotes, one with an extra dot at the end, and one with the question mark at the right position outside of the double quotes.
askQuestion("who produced \"all or nothin' at all?\" of Bruce Springsteen. Be as complete as possible.");
askQuestion("who produced the song \"all or nothin' at all?\" from the album Human Touch of Bruce Springsteen. Be as complete as possible.");
askQuestion("who produced \"all or nothin' at all?\"");
askQuestion("who produced \"all or nothin' at all?\".");
askQuestion("who produced \"all or nothin' at all\"?");
The LLM answers are as follows.
- The song “All or Nothin’ at All” was produced by Bruce Springsteen and his longtime collaborator, Steven Van Zandt. It was released on Springsteen’s 1995 album, “The Ghost of Tom Joad.”
This answer is completely wrong, Steven van Zandt did not produce this song and the song is of the album Human Touch released in 1992. - The song “All or Nothin’ at All” from the album Human Touch by Bruce Springsteen was produced by Bruce Springsteen himself.
This answer is correct but less complete than before. This is weird. - The song “All or Nothin’ at All” was produced by Bruce Springsteen, Roy Bittan, and Chuck Plotkin and was released on the album Human Touch in 1992.
This answer is correct, but again, Jon Landau is forgotten. - The song “All or Nothin’ at All” was produced by Bruce Springsteen and his longtime collaborator, Chuck Plotkin. It was released on Springsteen’s 1992 album, “Human Touch.”
Only a dot at the end is different from the previous question, but you receive a correct but less complete answer. - The song “All or Nothin’ at All” was produced by Bruce Springsteen and his longtime collaborator, Chuck Plotkin. It was released on Springsteen’s 1992 album, “Human Touch.”
This is the same answer as the previous one.
The attempts in order to retrieve a complete answer were not successful. Even minor changes resulted in a different, often worse answer.
Final Solution
As a final solution, the chat memory and the extra source document are combined. Ask the original questions again (also the question with the typo).
ArrayList<ChatMessage> chatMessages = new ArrayList<>();
chatMessages.add(new SystemMessage("Use the provided documents to answer the questions."));
chatMessages.add(new SystemMessage("If the answer cannot be found in the documents, write \"I could not find an answer\"."));
askQuestion("on which album was \"adam raised a cain\" originally released?", chatMessages);
askQuestion("what is the highest chart position of \"Greetings from Asbury Park, N.J.\" in the US?", chatMessages);
askQuestion("what is the highest chart position of the album \"tracks\" in canada?", chatMessages);
askQuestion("in which year was \"Highway Patrolman\" released?", chatMessages);
askQuestion("in which year was \"Higway Patrolman\" released?", chatMessages); // with typo
askQuestion("who produced \"all or nothin' at all?\"", chatMessages);
The LLM answers as follows:
- The album on which “Adam Raised a Cain” was originally released is “Darkness on the Edge of Town” by Bruce Springsteen, which was released in 1978.
This answer is correct. - Based on the information provided, the highest chart position of “Greetings from Asbury Park, N.J.” in the US is #60.
This answer is correct. - Based on the information provided, the highest chart position of the album “Tracks” in Canada is not specified. However, according to Billboard’s Canadian Hot 100 chart, the album’s single “The Rising” peaked at number 29 in 2002.
This answer is correct, but the additional information about The Rising is not correct. - The answer to the question “in which year was ‘Highway Patrolman’ released?” cannot be found in the provided information. However, based on the information provided, the song “Highway Patrolman” was released on the album “Nebraska” by Bruce Springsteen in 1982.
The answer is correct, and the LLM provides information on how it deduced the answer. - The answer to the question is “2006”.
This answer is incorrect, but you saw this earlier. Making a typo seems to have a big influence on the answer. - The answer to your question is “Bruce Springsteen”.
The answer is correct but incomplete.
The answers to the questions are correct in this solution. You still have to verify the answer, as information can be incomplete or might be wrong if you have a typo in your question.
Conclusion
In this post, you experimented a bit more with chat with documents, and you used some prompt engineering techniques. The correctness of the answer you receive highly depends on the following:
- The prompt you use, which question do you ask?
- Instructions for the LLM by means of System Messages.
- The format of the source data.
- Any typos you make.
Published at DZone with permission of Gunter Rotsaert, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments