Artificial intelligence (AI) and machine learning (ML) are two fields that work together to create computer systems capable of perception, recognition, decision-making, and translation. Separately, AI is the ability for a computer system to mimic human intelligence through math and logic, and ML builds off AI by developing methods that "learn" through experience and do not require instruction. In the AI/ML Zone, you'll find resources ranging from tutorials to use cases that will help you navigate this rapidly growing field.
Categorizing Content Without Labels Using Zero-Shot Classification
A Guide to Leveraging AI for Effective Knowledge Management
Today, in such a rapidly growing development environment, the automation of routine manual tasks is an important means of business competitiveness. Such manual and repeatable tasks slow down innovation tremendously; thus, automation is one of the significant constituents of modern software development practices. Automating developer routines with Swift not only simplifies workflows but also reduces error rates and enhances productivity. Furthermore, it provides teams with a sandbox for testing new APIs, technologies, and approaches, unlocking significant value for experimentation. While this article offers an overview, the key insight lies in how Swift’s powerful syntax and modern automation techniques alleviate much of the routine work that developers often encounter, freeing them up to focus on more creative and strategic tasks. Moreover, Swift-based automation plays a pivotal role in boosting development efficiency. Smoothening Development With Automation in Swift It is important to mention that Swift is not just a powerful and intuitive way of developing iOS applications from Apple but also proved to be more than a limitation in the method of making iOS applications. With its support to automate almost all development tasks, it saves time from mundane processes. From generating boilerplate code to automating builds and testing processes, Swift helps grant access to scripts and custom tools so that developers can improve workflows. Furthermore, Swift automation lets teams make the whole process easier by building up a system for manipulating files, parsing data, and managing dependencies. To that end, scripting tasks that involve mundane repetition, like setting up projects or formatting code, helps assure consistency in the codebase and eliminates human error. This goes well with Agile methodologies, where all is about flexibility and iterative development. Developers can immediately adapt to changed requirements, and meanwhile, automation would be doing the bulky activities that are time-consuming to ensure delivery is faster but not at the cost of quality. For example, consider a structured development process that might include strict rules for handling Jira tickets and Git branches. A Jira ticket could require detailed information about the active development branch, along with links for reviewers or leads — like GitLab compare links and Jenkins job links. Furthermore, the Git branch itself must be created correctly, following naming conventions, with well-defined ticket descriptions, an initial commit, and possibly even a tag. While these steps help maintain order, they can quickly become tedious. Even when developers memorize each step, it remains a repetitive and uninspiring task — exactly the kind of routine that automation can tackle. In addition to these processes, Swift automation can handle complex tasks like checking dependency complexity, monitoring hierarchy integrity, analyzing log files for patterns, and symbolizing crash files. This aligns well with Agile methodologies, where flexibility and iterative development are key. Developers can swiftly adapt to changing requirements while automation handles time-consuming backend tasks, ensuring quicker deliveries without compromising quality. Data-Driven Automation: Improving Code Quality Data-driven insights are becoming ever-increasingly a cornerstone for decision-making in software development, while automation supports the gathering and analysis of relevant metrics. Using Swift, a developer can automate writing scripts that pull data from performance logs, test reports, code quality tools, and other sources to create actionable insights. Thus, the team is in a position to find the bottlenecks or performance problems at an early stage in the development cycle. This is further complemented by integrating Swift with continuous integration platforms like Jenkins or Xcode Server. The insights derived may give the teams an opportunity to reconsider their strategies and show how decisions have been taken based on real data instead of assumptions. Leverage AI for Advanced Automation Artificial Intelligence is another frontier with which Swift can be combined for the automation of developer routines. AI-powered tools and frameworks can take automation to the next level by providing intelligent suggestions for code completion, error detection, and even predictive maintenance of software systems. With AI coupled in, Swift will enable developers to build wiser systems that understand user interactions and proactively solve problems before they scale. Applications developed on Swift, for example, can be designed with embedded machine learning models that may predict bugs likely to happen, suggest improvements, or optimization of resources. Furthermore, Swift with integrated Apple Core ML allows developers to embed AI models into their applications, furthering automation with features such as real-time image recognition, natural language processing, and predictive analytics. Measuring Success: Automation in Action This can be measured in various ways, including reduced development time, higher quality of code, and faster time-to-market. Additionally, the automated systems will be able to show the developer how effective automation is and whether it really adds value to one's organization and avoids additional complexity. For example, Swift scripts can keep track of build times, test coverage, and the frequency of code changes to provide an idea of how automation improves development. This orientation toward the data will ensure that resources are spent only on the most value-added tasks and that the refinement of automation tools is done on a continuous basis to improve their performance. Best Practices for Automating With Swift To maximize the benefits of automation using Swift, developers should adhere to a few best practices: Start small: Automate repetitive tasks that are prone to errors, such as file generation or dependency management, before moving on to more complex workflows.Continuous integration: Integrate Swift scripts with CI tools to automate testing, deployment, and code reviews.Monitor and refine: Use data-driven insights to measure the impact of automation on development efficiency, and continuously refine automation scripts for maximum benefit.Leverage AI: Integrate AI tools and frameworks to build intelligent automation systems that can predict and solve problems in real-time. Why Swift for iOS Team Automation? While various tools and languages can be used for scripting — such as Fastlane, Makefile, Rakefile, and Apple Automator, along with Python, Ruby, and Bash — Swift offers specific advantages in an iOS team setting: Familiarity: The team is already well-versed in Swift.Community support: Swift boasts a strong, supportive community.Open-source resources: Swift has a wealth of open-source projects available.Experimental potential: Swift allows for creative experimentation since this automation project is internal and doesn’t impact end users directly. For instance, a team not yet using Swift Concurrency could try it within their automation tools, creating a unique environment for learning and testing new technologies. Conclusion This article corresponds to great views on the potential of automation for state-of-the-art development. This is because, once developers automate basic tasks with Swift, they will not just cut down on errors and increase efficiency but also manage to free up their time for creative ideas and innovations. In particular, powerful syntax combined with AI and data-driven insights offers a compelling toolset for streamlining workflows and ensuring long-term success in software development. This will mean that as development teams continue to implement more Agile methodologies and AI-driven processes, automation with Swift is only going to be a very serious strategy in staying competitive within an ever-changing industry.
The Retrieval-Augmented Generation (RAG) model integrates two robust methodologies: information retrieval and language generation. The model initially gathers pertinent information from an extensive dataset in response to a query, subsequently formulating a reply utilizing the context obtained. This design improves the precision of produced responses by anchoring them in real data, rendering it especially beneficial for intricate information requests across extensive datasets, like lengthy PDF files. This tutorial will walk you through the process of utilizing Python to extract and process text from a PDF document, create embeddings, conduct cosine similarity calculations, and respond to queries derived from the extracted content. Prerequisites Ensure you have the following libraries installed in your Python environment: PyMuPDF (fitz): For extracting text from PDFs.rake-nltk: For phrase extraction.openai: To interact with OpenAI's embedding and language models.pandas: To handle and export data.numpy and scipy: For numerical operations and cosine similarity calculations. Step-by-Step Guide Step 1: Import Libraries and Open the PDF Import the libraries and open the PDF using this code: Python import fitz # PyMuPDF Open the PDF file pdf_document = "path/to/your/document.pdf" document = fitz.open(pdf_document) # Initialize a dictionary to hold the text for each page pdf_text = {} # Loop through each page for page_number in range(document.page_count): # Get a page page = document.load_page(page_number) # Extract text from the page text = page.get_text() # Store the extracted text in the dictionary pdf_text[page_number + 1] = text # Pages are 1-indexed for readability # Close the document document.close() # Output the dictionary for page, text in pdf_text.items(): print(f"Text from page {page}:\n{text}\n") Step 2: Chunk Text for Embedding The text needs to be broken down into smaller, manageable chunks. We use RecursiveCharacterTextSplitter to split each page's text into overlapping chunks. Using the RecursiveCharacterTextSplitter to break text into smaller, manageable chunks with overlapping sections is important for several reasons, especially when dealing with natural language processing (NLP) tasks, large documents, or continuous text analysis. Here’s why it’s beneficial: 1. Improves Context Retention When text is split into overlapping chunks, each chunk retains some of the previous and following content. This helps preserve context, which is especially crucial for algorithms that rely on surrounding information, like NLP models.Overlapping text ensures that important details spanning across chunk boundaries aren’t lost, which is critical for maintaining the coherence of the information. 2. Enhances Accuracy in NLP Tasks Many NLP models (such as question-answering systems or sentiment analysis models) can perform better when provided with complete context. Overlapping chunks help these models access more relevant information, leading to more accurate and reliable results. 3. Manages Memory and Processing Efficiency Breaking down large texts into smaller parts helps manage memory usage and processing time, making it feasible to handle extensive documents without overwhelming the system.Smaller chunks allow for parallel processing, improving the efficiency of tasks like keyword extraction, summarization, or entity recognition on large texts. 4. Facilitates Chunked Data Storage and Retrieval Overlapping chunks can be stored and retrieved more flexibly, making it easier to reconstruct portions of the text for further processing, such as when analyzing text in a sliding window approach for time series data or contextual searches. 5. Supports Recursive Splitting for Optimal Size RecursiveCharacterTextSplitter can recursively split text until the desired chunk size is achieved, allowing you to tailor chunk sizes according to model input limits or memory constraints while keeping context intact. Python from langchain_text_splitters import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) # Split text into chunks page_chunks = {} for page, text in pdf_text.items(): chunks = text_splitter.split_text(text) page_chunks[page] = chunks # Output chunks for each page for page, chunks in page_chunks.items(): print(f"Text chunks from page {page}:") for i, chunk in enumerate(chunks, start=1): print(f"Chunk {i}:\n{chunk}\n") Step 3: Extract Key Phrases To extract meaningful phrases from the text, we use rake-nltk, a Python implementation of the Rapid Automatic Keyword Extraction (RAKE) algorithm. RAKE is an algorithm for extracting keywords from text, designed to be fast and efficient. It works by identifying words or phrases that are statistically significant within a document. Here's an overview of how it works: How RAKE Works Word Segmentation: It splits the text into individual words and phrases, discarding common stop words (like "and," "the," "is," etc.).Phrase Construction: RAKE groups together contiguous words that are not stop words to form candidate phrases.Scoring: Each candidate phrase is given a score based on the frequency of its words and the degree of co-occurrence with other words in the text. This score helps determine the relevance of each phrase as a potential keyword.Sorting: The phrases are sorted based on their scores, and the highest-scoring phrases are selected as keywords. Python from rake_nltk import Rake rake = Rake() # Extract phrases from each page and store in a dictionary page_phrases = {} for page, text in pdf_text.items(): rake.extract_keywords_from_text(text) phrases = rake.get_ranked_phrases() page_phrases[page] = phrases chunk_phrases = {} # Extract phrases for each chunk for page, chunks in page_chunks.items(): for chunk_number, chunk in enumerate(chunks, start=1): rake.extract_keywords_from_text(chunk) phrases = rake.get_ranked_phrases() chunk_phrases[(page, chunk_number)] = phrases # Output phrases for each chunk for (page, chunk_number), phrases in chunk_phrases.items(): print(f"Key phrases from page {page}, chunk {chunk_number}:\n{phrases}\n") Step 4: Generate Embeddings Generate embeddings for each phrase using OpenAI's text-embedding-ada-002 model and save in the Excel format. This model generates numerical representations (embeddings) of text. These embeddings capture the semantic meaning of the text, allowing you to compare and analyze pieces of text based on their content. Python # Function to get embeddings for a phrase openai.api_key = "YOUR-API-KEY" def get_embedding(phrase): response = openai.Embedding.create(input=phrase, model="text-embedding-ada-002") return response['data'][0]['embedding'] # Dictionary to hold embeddings phrase_embeddings = {} # Generate embeddings for each phrase for (page, chunk_number), phrases in chunk_phrases.items(): embeddings = [get_embedding(phrase) for phrase in phrases] phrase_embeddings[(page, chunk_number)] = list(zip(phrases, embeddings)) # Prepare data for Excel excel_data = [] for (page, chunk_number), phrases in phrase_embeddings.items(): for phrase, embedding in phrases: excel_data.append({ "Page": page, "Chunk": chunk_number, "Phrase": phrase, "Embedding": embedding }) # Create a DataFrame df = pd.DataFrame(excel_data) # Save to Excel excel_filename = "phrases_embeddings.xlsx" df.to_excel(excel_filename, index=False) print(f"Embeddings saved to {excel_filename}") Step 5: Query Processing and Similarity Calculation Generate embeddings for query phrases and find the most similar chunks using cosine similarity. Cosine similarity is a measure used to determine how similar two vectors are based on the angle between them in a multi-dimensional space. It’s commonly used in text analysis and information retrieval to compare text embeddings or document vectors, as it quantifies similarity irrespective of the vectors' magnitude. In the context of text embeddings, cosine similarity helps identify which documents or sentences are closely related based on their meaning, rather than just their content or word count. Python def extract_phrases_from_query(query): rake.extract_keywords_from_text(query) return rake.get_ranked_phrases() # Example query query = "What are the results of the 2DRA algorithm?(This question should be based on your pdf)" # Extract phrases from the query query_phrases = extract_phrases_from_query(query) # Output query phrases print(f"Query phrases:\n{query_phrases}\n") def get_embeddings(phrases): return [openai.Embedding.create(input=phrase, model="text-embedding-ada-002")['data'][0]['embedding'] for phrase in phrases] # Get embeddings for query phrases query_embeddings = get_embeddings(query_phrases) import numpy as np from scipy.spatial.distance import cosine # Function to calculate cosine similarity def cosine_similarity(embedding1, embedding2): return 1 - cosine(embedding1, embedding2) # Dictionary to store similarities chunk_similarities = {} # Calculate cosine similarity for each chunk for (page, chunk_number), phrases in phrase_embeddings.items(): similarities = [] for phrase, embedding in phrases: phrase_similarities = [cosine_similarity(embedding, query_embedding) for query_embedding in query_embeddings] similarities.append(max(phrase_similarities)) # Choose the highest similarity for each phrase average_similarity = np.mean(similarities) # Average similarity for the chunk chunk_similarities[(page, chunk_number)] = average_similarity # Get top 5 chunks by similarity top_chunks = sorted(chunk_similarities.items(), key=lambda x: x[1], reverse=True)[:5] # Output top 5 chunks print("Top 5 most relatable chunks:") selected_chunks = [] for (page, chunk_number), similarity in top_chunks: print(f"Page: {page}, Chunk: {chunk_number}, Similarity: {similarity}") print(f"Chunk text:\n{page_chunks[page][chunk_number-1]}\n") selected_chunks.append(page_chunks[page][chunk_number-1]) Step 6: Generate and Retrieve Answer Using OpenAI Compose the context for the query from the most similar chunks and retrieve the answer using OpenAI’s GPT model. Python context = "\n\n".join(selected_chunks) prompt = f"Answer the following query based on the provided text:\n\n{context}\n\nQuery: {query}\nAnswer:" # Use the OpenAI API to get a response response = openai.ChatCompletion.create( model="gpt-4", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt} ], max_tokens=300, temperature=0.1 ) # Extract the answer from the response answer = response['choices'][0]['message']['content'].strip() # Output the answer print(f"Answer:\n{answer}") Finally, this is the answer that I received after asking that question: Answer: The 2DRA model was utilized to perform data recovery on the Virtual Machine (VM) affected by ransomware. It was successful in retrieving all the 14,957 encrypted files. Additionally, an analysis of the encrypted files and their associated hash values on the VM was conducted using the 2DRA model after the execution of WannaCry ransomware. The analysis revealed that the hexadecimal values of the files were distinct prior to encryption, but were altered after the encryption. The solution is based on the PDF I used in step 1. You'll see the solution in the PDF you submitted. This concludes implementing a basic RAG pipeline that reads PDF content, extracts meaningful phrases, generates embeddings, calculates similarities, and answers queries based on the most relevant content.
The Setting In a previous article, we introduced Dust, an open-source Actor system for Java 21 and above. We explained the basic ideas behind it and gave a tiny complete example consisting of two Actors ping-ponging messages between each other. Assuming you have read that article, the time has come to move on to bigger things. So in this article, we will show you how to use Dust Actors to build a small demonstration application that will: Take a topic of interest and a list of names of entities that you are interested in being notified about. For instance: Topic: Electric vehicle chargingEntities: Companies, technologies, locationsAutomatically find valid RSS news feeds that would supply news about that topic.Automatically set up Actors to periodically read those news feeds, confirm they match the topic, extract the main content of the article, and provide a list of matching entities found in the content. We make extensive use of ChatGPT, so to run the code, you’ll have to provide your own key (or see dust-nlp for how to use a local instance of Ollama). For succinctness, we’ll use the Groovy scripting language in this article. Think of it as Java without the ‘;’s. We only show the important snippets of the code; the entire code can be found on GitHub at dust-demos-topics. This application uses code from most of the Dust repos: dust-core, dust-http, dust-html, dust-feeds, and, most importantly, dust-nlp. They can all be found on GitHub - Dust (linked in the introductory paragraph). Pipeline Actors In the previous article, we saw how to create an instance of an Actor class by passing its Props (returned by SomeActor.props()) to a method (actorOf()), which actually created the Actor. A PipelineActor is another way to create instances of Actors, except they become "stages" in a sequential pipe. For instance: Groovy actorOf(PipelineActor.props([ RssLocatorActor.props(topic, chatGPTRef), FeedHubActor.props(), ServiceManagerActor.props( ContentFilterServiceActor.props(topic, chatGPTRef), 4 ), ServiceManagerActor.props( EntitiesExtractionServiceActor.props(entities, chatGPTRef), 4 ), LogActor.props() ], [ 'rss-locator', 'feeds', 'topic-filter', 'entities-extraction', 'logger' ]), 'pipeline' ) This creates just about all the Actors we need to implement the application. Using actorOf(), it is creating an instance of PipelineActor from PipelineActor’s props. These props take two arguments: a list of other Actor Props, and a list of names for the Actors to be created. In this case, we will have five Actors — all children of the PipelineActor (whose name will be pipeline). In order, the stages are: An instance of an RssLocatorActor which is passed with the topic and an ActorRef to an Actor interacting with ChatGPT (more on that later)An instance of a FeedHubActorA ServiceManagerActor managing ContentFilterServiceActorsA ServiceManagerActor managing EntitiesExtractionServiceActors A logging Actor A pipeline works as follows: messages sent to the PipelineActor from stage N of the pipe are sent on to stage N+1. If a message is sent to the pipe from outside of it, it is sent to stage 1, and if a message leaving the pipe has a "return to sender" reference, it will be sent to that Actor; otherwise, it is dropped. In our case, messages are introduced into the pipe from RssLocatorActor (as it is given the topic, you can guess its job). The pipeline starts by using ChatGPT to look for feeds matching the topic (rss-locator). It generates messages (just URLs for discovered, validated feeds in this case) and sends them back to the pipe, whence they go to feeds who will set up Actors to continuously monitor those RSS feeds. feeds then sends any content it finds (as messages) back to the pipe where they go to topic-filter. Here they are checked (using ChatGPT) to see that the article really is about the topic and not something just slightly related (e.g., about EVs, but not charging in particular). If it passes this filter, it is sent back to the pipe and picked up by entities-extraction. This Actor recognizes specified named entities which are passed to the final stage, which simply logs them. ServiceManager and Service Actors What is going on with stages 3 and 4 above? This is another design pattern in Dust. A ServiceActor is any Actor designed to accept one message from outside, process it in some fashion, and then die when finished. A ServiceManageActor manages a fixed-size pool of identical Service Actors (defined by the Props in the first argument). The maximum size of the pool is defined by the second argument. When it receives a message, it waits until it has space in the pool, then it creates a new child (from the supplied Props), and sends it the message as though it came from the original sender. If the child is to reply to the sender (which is usually the case), it uses its parent (the ServiceManager) as its from address. So Actors send messages to the ServiceManager and appear to receive responses back from the ServiceManager — just who processed it is hidden. Service managers act as natural throttlers, and since Service Actors only ever process one external message, their internal state can be very simple. News Reader Application That was a lot to digest, but now we have covered the ideas involved we can get down to the details. Usually with a Dust application, we create the ActorSystem and then some top-level Actor under /user. We let that Actor set everything else up and run the application. So our main() method is basically as follows: Groovy // Build an ActorSystem to put things in ActorSystem system = new ActorSystem('news-reader') // Get topic and entities from json file URL resource = system.class.classLoader.getResource("reader.json") Map config = new JsonSlurper().parse(new File(resource.toURI())) as Map // Root does everything else – it is created under the Actor /user by system.context system.context.actorOf( RootActor.props( (String)config.topic, (List<String>)config.entities ), 'root').waitForDeath() We get our topic and entities from the JSON config file, create an instance of a RootActor passing it the defined topic and entities, and then wait for it to die (since everything from this point on is running on separate virtual threads, we’d simply fall off the end of the application if we did not wait). RootActor As a RootActor’s only real job is to set things up, and since it won’t be receiving any messages, we simply leave it with default Actor Behavior and set things up in its preStart() method: Groovy void preStart() { ActorRef chatGPTRef = actorOf(ServiceManagerActor.props( ChatGptAPIServiceActor.props(null, key), 4 ),'chat-gpt') actorOf(PipelineActor.props([ RssLocatorActor.props(topic, chatGPTRef), FeedHubActor.props(), ServiceManagerActor.props( ContentFilterServiceActor.props(topic, chatGPTRef), 4 ), ServiceManagerActor.props( EntitiesExtractionServiceActor.props(entities, chatGPTRef), 4 ), LogActor.props() ], [ 'rss-locator', 'feeds', 'topic-filter', 'entities-extraction', 'logger' ]), 'pipeline' ) This creates two Actors: a pool of ChatGptAPIServiceActors (see dust-nlp) and our previously seen main pipeline. A reference to the ChatGPT processing Actors (chatGPTRef) is passed into the Props of the pipe stages that will need LLM support. RssLocatorActor The job of this Actor is to find RSS feeds that might have news articles matching the topic, and we take the obvious approach of asking ChatGPT. But, as we all know, ChatGPT sometimes tries too hard to be helpful. In particular, it may give us URLs that don’t exist, or that exist but are not RSS feeds, so we need to do a second verification pass on whatever ChatGPT tells us. RssLocatorActor preStart() sends a StartMsg to itself, which sends an instance of ChatGptRequestResponseMsg to ChatGPT. This asks for a list of feeds matching the topic. ChatGPT sends us this message back with its response and we use a helper method (listFromUtterance) to parse out the list of URLs. We bundle these up into a VerifyFeeds message which we repeatedly send to ourselves to process the URLs one by one. For each one, we try to treat it as a valid feed by fetching it. If we get a response, we try to parse it as a syndication message. If it is, we pass its URL back to our parent (the pipeline). If not, we ignore it. Groovy ActorBehavior createBehavior() { (message) -> { switch(message) { case StartMsg: chatGPTRef.tell(new ChatGptRequestResponseMsg( """Consider the following topic and give me a numerical list consisting *only* of urls for RSS feeds that might contain information about the topic: '$topic'. Try hard to find many real RSS urls. Each entry should consist only of the URL - nothing else. Include no descriptive text.""" ), self) break case ChatGptRequestResponseMsg: ChatGptRequestResponseMsg msg = (ChatGptRequestResponseMsg)message List<String> urls = listFromUtterance(msg.utterance) self.tell(new VerifyFeedsMsg(urls), self) break /* RSSLocatorActor is a subclass of HttpClientActor (dust-http) which * provides a request() method. This does an http GET and sends back the * result in an HttpRequestResponseMsg */ case VerifyFeedsMsg: verifyFeedsMsg = (VerifyFeedsMsg)message if (! verifyFeedsMsg.urls.isEmpty()) { try { request(verifyFeedsMsg.urls.removeFirst()) } catch (Exception e) { log.warn e.message } } break /* * Verify the feed. Does the site exist and is it a feed ?? If we got something * try and parse it as a feed. If this fails then simply warn and move on. */ case HttpRequestResponseMsg: HttpRequestResponseMsg msg = (HttpRequestResponseMsg)message if (null == msg.exception && msg.response.successful) { try { new SyndFeedInput().build( new XmlReader(msg.response.body().byteStream()) ) parent.tell(msg.request.url().toString(), self) } catch (Exception e) { log.warn "URL: ${msg.request.url().toString()} is not an RSS feed!" } } else log.warn "URL: ${msg.request.url().toString()} does not exist!" // Check next URL self.tell(verifyFeedsMsg, self) break } } } static class VerifyFeedsMsg implements Serializable { List<String> urls VerifyFeedsMsg(List<String> urls) { this.urls = urls } } FeedHubActor Groovy ActorBehavior createBehavior() { (message) -> { switch(message) { case String: String msg = (String) message log.info "adding RSS feed $msg" actorOf( TransientRssFeedPipeActor.props(msg, 3600*1000L) ).tell(new StartMsg(), self) break // From the TransientRssFeedPipeActor case HtmlDocumentMsg: parent.tell(message, self) break } } } When FeedHubActor receives a String, it knows it is the URL of a valid, topic-related RSS feed. So it creates an instance of TransientRssFeedPipeActor whose Props parameters are the URL and how often (in milliseconds) to visit the feed for new articles (here, every hour). FeedPipeActors (in dust-feeds) need to receive a StartMsg to start up, so we send it one. The TransientRSSFeedPipeActor manages the whole job of parsing the feed content, visiting the linked referenced sites, and sending their content (as HtmlDocumentMsg) to its parent: FeedHubActor. When FeedHubActor receives an HtmlDocumentMsg, it sends it on to its parent: the Pipeline. Transient here refers to the fact that its Actor state is not saved. So if the application is stopped and restarted, it has no idea what it already saw. Repo dust-feeds also contains a persistent version of this Actor for more robust applications. ContentFilterServiceActor The job of this Actor is simply to check to see if the HtmlDocumentMsg refers to the topic in some way. We use an LLM again and check the title of the content. This can actually be quite powerful as with an article titled "Renault and The Mobility House launch V2G project in France." The LLM knew V2G is "Vehicle to Grid" and so the article is related to charging – I didn’t . . . The Actor itself is quite simple: we ask ChatGPT if the document’s title is associated with the topic. If it is we pass the document on down the pipe; if not, we don’t. In either case, we die, as we are a ServiceActor: Groovy ActorBehavior createBehavior() { (message) -> { switch(message) { case HtmlDocumentMsg: originalMsg = (HtmlDocumentMsg)message String request = “””Does '${originalMsg.title}' refer to ‘$topic’. Answer simply yes or no."”” chatGPTRef.tell(new ChatGptRequestResponseMsg(request), self) break case ChatGptRequestResponseMsg: ChatGptRequestResponseMsg msg = (ChatGptRequestResponseMsg)message String response = msg.getUtterance()?.toLowerCase() if (response?.toLowerCase()?.trim()?.startsWith('yes')) { /* * The actual pipe stage is my parent (the service manager) * and my grand parent is the pipe, so send response back to * the pipe as though from my parent */ grandParent.tell(originalMsg, parent) } stopSelf() break } } } EntitiesExtractionServiceActor The last non-trivial Actor in the pipeline (we’ll let you write a LoggingActor yourself!) is a service Actor to extract named entities from the core content of the document. HtmlDocumentMsg (dust-html) has a method, getWholeText(), which analyzes the HTML content of the document, identifies the core content (strips out ads, etc.), and returns the resulting plain text. We then pass this plain text off to ChatGPT asking it to give us a structured list back when it finds entities matching our categories. A helper method (fromEntitiesList()) parses the returned response – which will look something like: Plain Text Companies: 1. Tesla 2. General Motors Location: 1. Maryland . . . into a list of lists: Plain Text [url-of-source, entity-name, entity-values] We then send this list to our grandparent (the pipe) as though it came from our parent (the service manager) and the pipe passes it down to the logging stage. Groovy ActorBehavior createBehavior() { (message) -> { switch(message) { case HtmlDocumentMsg: originalMsg = (HtmlDocumentMsg)message String mainText = originalMsg.getWholeText() if (mainText) { String text = "${originalMsg.title} --- $mainText" chatGPTRef.tell( new ChatGptRequestResponseMsg( """Following is a list of entity categories: ${entities.join(', ')}. For each category give me a numerical list of mentions in the text. Precede each list with its category followed by ':'. Do not create new categories. Reply in plain text, not markdown. If the entity mentioned is a company use its formal name.\n\n ${text} """ ), self ) } else context.stop(self) break case ChatGptRequestResponseMsg: ChatGptRequestResponseMsg msg = (ChatGptRequestResponseMsg)message fromEntitiesList(msg.getUtterance())?.each { if (it.value != []) { grandParent.tell( [originalMsg.source, it.key, it.value] as Serializable, parent ) } } stopSelf() break } } } How Did It Do? This little app is purely to show how Dust Actors interact cleanly with LLMs and how NLP pipelines can be easily constructed. A real application would do much more; for example: Check for duplicate content. Different RSS feeds often link to the same content (especially in this case where the feeds are all associated with the same topic).Do more with the end result — we simply log the entities. A real application might look for certain entities and trigger further actions on them — e.g., summarizing and presenting the article. This quickly leads down the path to Agentic Dust. Use persistent Actors (see dust-core). That said, how did it do? Our reader.json file contains: JSON { "topic" : "Electric Vehicle Charging", "entities": ["Company", "Technology", "Product", "Location"] } The log shows: RssLocatorActor - URL: https://cleantechnica.com/feed/ does not exist! RssLocatorActor - URL: https://www.greencarreports.com/rss/news does not exist! FeedHubActor - adding RSS feed https://chargedevs.com/feed/ RssLocatorActor - URL: https://www.greencarcongress.com/index.xml exists but is not an RSS feed! FeedHubActor - adding RSS feed https://www.electrive.com/feed/ RssLocatorActor - URL: https://www.autoblog.com/rss.xml does not exist! RssLocatorActor - URL: https://www.plugincars.com/feed exists but is not an RSS feed! FeedHubActor - adding RSS feed https://www.teslarati.com/feed/ [https://www.electrive.com/2024/10/25/mercedes-bmw-get-green-light-for-fast-charging-network-in-china/, company, [Mercedes-Benz, BMW, Ionity, General Motors, Honda, Hyundai, Kia, Stellantis, Toyota, PowerX, Ashok Leyland]] [https://www.electrive.com/2024/10/25/mercedes-bmw-get-green-light-for-fast-charging-network-in-china/, technology, [Ionchi, Plug&Charge]] [https://www.electrive.com/2024/10/25/mercedes-bmw-get-green-light-for-fast-charging-network-in-china/, product, [Piaggio EVs, Switch EiV12 electric buses]] [https://www.electrive.com/2024/10/25/mercedes-bmw-get-green-light-for-fast-charging-network-in-china/, location, [China, European Economic Area, Beijing, Qingdao, Nanjing, North America, North Carolina, Mannheim, Sandy Springs, Japan]] [https://www.electrive.com/2024/10/24/tesla-appoints-new-head-of-charging-infrastructure-and-reveals-plans-for-charging-park/, company, [Tesla]] [https://www.electrive.com/2024/10/24/tesla-appoints-new-head-of-charging-infrastructure-and-reveals-plans-for-charging-park/, technology, [Supercharger, Megapack, Powerwall, Solar system, Megapack stationary storage units]] [https://www.electrive.com/2024/10/24/tesla-appoints-new-head-of-charging-infrastructure-and-reveals-plans-for-charging-park/, product, [Supercharger charging stations, Megapack division, Powerwall home power storage system]] [https://www.electrive.com/2024/10/24/tesla-appoints-new-head-of-charging-infrastructure-and-reveals-plans-for-charging-park/, location, [California, San Francisco, Los Angeles, Interstate 5, Lost Hills]] [https://chargedevs.com/features/paired-powers-ev-chargers-let-customers-mix-and-match-solar-storage-and-grid-power/, company, [Paired Power]] [https://chargedevs.com/features/paired-powers-ev-chargers-let-customers-mix-and-match-solar-storage-and-grid-power/, technology, [EV chargers]] [https://chargedevs.com/features/paired-powers-ev-chargers-let-customers-mix-and-match-solar-storage-and-grid-power/, product, [PairTree, PairFleet]] [https://chargedevs.com/features/paired-powers-ev-chargers-let-customers-mix-and-match-solar-storage-and-grid-power/, location, [California]] [https://electrek.co/2024/10/23/lg-dc-fast-charger-us/, company, [LG Business Solutions USA, LG Electronics]] [https://electrek.co/2024/10/23/lg-dc-fast-charger-us/, technology, [CCS/NACS, SAE J1772, UL 2594, USB, Power Bank, Over-the-air software updates]] [https://electrek.co/2024/10/23/lg-dc-fast-charger-us/, product, [LG DC fast charger, LG EVD175SK-PN, Level 2 chargers, Level 3 chargers, Ultra-fast chargers]] [https://electrek.co/2024/10/23/lg-dc-fast-charger-us/, location, [US, Texas, Fort Worth, Nevada, White River Junction]] [https://electrek.co/2024/10/23/tesla-unveils-oasis-supercharger-concept-solar-farm-megapacks/, company, [Tesla]] . . . and a lot more.
Once upon a time, in the world of Machine Learning, data roamed the vast land of algorithms, hoping to be understood. While many algorithms tried their best, something was missing: a spark, a certain... connection. Then, the Transformer algorithm came along and changed everything! This isn’t just another machine learning model: it’s an algorithm that rocked the tech world. Let’s dive into the tale of the Transformer, an algorithm powered by “attention” (yes, that’s the magic word!) that made data feel truly seen for the first time. Meet the Transformer Imagine the Transformer as a super-organized matchmaker. While most models take in data, look it over from start to finish, and try to make sense of it in a linear way, Transformers say, “No way! I want to see all the data at every possible angle and find the connections that matter most.” Transformers are built on attention mechanisms, which let them focus on the most important pieces of information — think of it like highlighting, bolding, and starring the right words in a textbook, only way cooler. And they don’t just glance once and move on. Transformers keep going back, checking, re-checking, and attending to the data until every important part is understood. Attention: The True Hero Attention is the Transformer’s superpower. If you’ve ever been on a video call while half-focused, you know it’s hard to keep track of what’s really going on. But imagine if you could give your undivided attention to multiple things at once — that’s what Transformers do. By focusing on different parts of data simultaneously, they find hidden patterns that other algorithms miss. No more reading data like a book, page by page. Transformers can glance over the whole thing and zero in on the parts that matter the most, no matter where they are. How It Works (Without Frying Your Brain) Here's a fun way to think of it: say you have a bag of M&Ms and want to eat only the red ones. Traditional algorithms might make you pour out the entire bag, sort through them, and separate out the reds (sequentially). But Transformers just scan the bag and pluck out each red one with zero hesitation. They don’t need to line up each M&M in a row — they know where each red one is without breaking a sweat! In Transformer lingo, this is done through self-attention. Transformers can see every word (or piece of data) and understand its role in the overall sentence or structure. So even if a word appears far away in a sentence, the Transformer gets the full context instantly, connecting “apple” to “pie” even if they’re pages apart. Why Attention Is Important: A Fun Comparison Without Attention With Attention (Transformer) Imagine listening to a long story, word by word, from start to finish without interruptions. Picture having the entire story laid out, with key parts highlighted and emphasized. Important connections might get lost or forgotten along the way. Transformers can focus on the most relevant pieces instantly, making connections effortlessly. Processing is slow and can miss context if words are far apart. Every part of the data is seen in context, making understanding faster and more accurate. Encoder-Decoder: A Match Made in Heaven Transformers have two main parts: an encoder and a decoder. Think of the encoder as the translator who understands the data, and the decoder as the one who explains it in the target language. For example, in translation tasks, the encoder reads the input text in English and gets its meaning. Then the decoder takes this meaning and produces an output in, say, French. Voilà! encoder decoder Takes the input data and understands it in its original form Translates the encoded meaning into the target output, such as translating from one language to another Identifies important words, phrases, or patterns in the data Uses this "understood" data to form the most accurate output based on context Transformers in Action Transformers are the brains behind today’s language models, chatbots, and language translators. From chatty AI models to autocomplete text suggestions, whenever you see AI really understanding language, you’ve got Transformers to thank. How Transformers Are Used in Real-Life Application What Transformers Do Language Translation Understands the context of each word to ensure accurate translation Chatbots and Virtual Assistants Recognizes the meaning of your questions and responds with contextually appropriate answers Autocomplete Text Predicts your next words based on all words typed so far, not just the last one Sentiment Analysis Understands context to interpret whether reviews are positive or negative, even with complex phrasing Why Transformers Are Here to Stay Transformers are insanely good at multitasking, handling massive amounts of data, and zeroing in on the important parts. They’re so powerful that they’re setting new records in natural language processing and are quickly becoming a standard in many industries. Who wouldn’t want a model that’s this quick, attentive, and capable? Why We Love Transformers What It Means for Us Speed and Accuracy Handles huge amounts of data fast, making applications faster Context Awareness Knows when “apple” is a fruit vs. “Apple” the brand, thanks to understanding context Multitasking Champs Can focus on multiple parts of data simultaneously Wrapping It Up: The Transformer Legacy If you remember one thing about Transformers, it should be this: they’re the ultimate focus masters of data. They see everything, they analyze relationships instantly, and they find meaning in ways other algorithms could only dream of. So next time you’re using an AI-powered tool that understands your sentences or predicts your words, give a little nod to the Transformers — the algorithm that gave data a voice, and attention its due credit. And that, folks, is how Transformers changed the game forever!
Today, we’re diving into Agentic AI with OpenAI’s Swarm framework to create something fun and useful — a travel assistant that can help book your hotel, set up restaurant reservations, and even arrange a rental car, all in one place. What makes it special? We’re setting it up with multi-agent AI, which means we’ll have multiple “mini-assistants,” each taking care of one specific task, working together to make your travel planning smooth. What Are Agents? An Agent is like a digital helper designed to handle a single job. It’s autonomous, meaning once it’s set up, it can make decisions and respond to changes without needing constant instructions from a human. Think of it like this: you have one agent just for handling hotel bookings and another one focused on restaurant reservations. Each is designed to be an expert in its own little area. Why Multiple Agents? Why not just one big assistant? Well, by using multi-agent systems, we can have a collection of these agents working together, each focusing on what it does best. It’s like having a team of assistants — one manages the hotel, one handles the restaurant reservations, and another covers car rentals. This way, each agent specializes in its task, but together, they cover the whole travel process. Step 1: Prerequisites Before we dive in, make sure you have: Basic familiarity with Python (nothing too advanced). Python is installed on your machine and has access to a terminal.Editor of your choice (I used Microsoft Visual Studio Code)OpenAI API Key: Swarm uses OpenAI’s models, so you’ll need an API key. You can get one if you don’t have one by signing up at OpenAI. Make sure you check the API limits page if you have any questions about the API Limits and Permissions. Step 2: Setting Up Your Project Environment First, you’ll need to set up your API key. Here’s how to do it: Go to the OpenAI API key page and create a new key.Copy this key — you’ll need it for both the terminal and our code. To set the key in your terminal, run: export OPENAI_API_KEY="your_openai_api_key_here" Also, don’t forget to replace "your_openai_api_key_here" with your actual API key. You’ll also want to add this key directly in your code, which we’ll show in the next section. Step 3: Setting Up and Structuring the Code Our travel assistant will be built using multiple files, each representing a different agent. Let’s walk through each component step by step, explaining why we’ve made certain decisions and how each part functions. Step 3.1: Setting Up Your Project Environment Before we jump into coding, let's get our environment ready with Swarm. Install Swarm: First, open your terminal and install Swarm with this command: pip install git+https://github.com/openai/swarm.git 2. Organize Our Project: Next, set up a simple project structure: Plain Text SmartTravelConcierge/ ├── main.py # Main script to run our travel assistant ├── agents/ # Folder for all agents │ ├── triage_agent.py # Routes requests to the correct agent │ ├── hotel_agent.py # Handles hotel bookings │ ├── restaurant_agent.py # Manages restaurant reservations │ └── car_rental_agent.py # Takes care of car rentals What Each Agent Does Triage Agent: Listens to user requests, decides which agent (hotel, restaurant, or car rental) should handle it, and hands off the task.Hotel Agent: Asks for hotel details like location, dates, and budget.Restaurant Agent: Gathers reservation info, such as date, time, and party size.Car Rental Agent: Collects car rental details, like pickup location, dates, and car type. With everything set up, we’re ready to start coding each agent in its file. Let’s jump in! Step 3.2: Building the Core Script (main.py) main.py is the central script that runs our Smart Travel Concierge. It’s where we initialize the Swarm client, handle user input, and route requests to the right agents. Let’s take a look at the code: Python import openai from swarm import Swarm from agents.triage_agent import triage_agent, user_context # Set your OpenAI API key here openai.api_key = "your_openai_api_key_here" # Replace with your actual API key # Initialize Swarm client client = Swarm() # Define ANSI color codes for a better experience in the terminal COLOR_BLUE = "\033[94m" COLOR_GREEN = "\033[92m" COLOR_RESET = "\033[0m" def handle_request(): print("Welcome to your Smart Travel Concierge!") print("You can ask me to book hotels, reserve restaurants, or arrange car rentals.") while True: user_input = input(f"{COLOR_GREEN}You:{COLOR_RESET} ") if user_input.lower() in ["exit", "quit", "stop"]: print("Goodbye! Safe travels!") break # Route the request to the Triage Agent for delegation response = client.run(agent=triage_agent, messages=[{"role": "user", "content": user_input}]) # Extract the final response content final_content = None for message in response.messages: if message['role'] == 'assistant' and message['content'] is not None: final_content = message['content'] # Print the assistant’s response if available if final_content: print(f"{COLOR_BLUE}Bot:{COLOR_RESET} {final_content}") # Only add additional prompts if the final content doesn't include detailed instructions if ("check-in" not in final_content.lower() and "location" not in final_content.lower() and "budget" not in final_content.lower() and user_context["intent"] == "hotel" and not user_context["details_provided"]): print(f"{COLOR_BLUE}Bot:{COLOR_RESET} Please provide your location, check-in and check-out dates, and budget.") user_context["details_provided"] = True # Avoid further prompts once requested elif ("reservation" not in final_content.lower() and user_context["intent"] == "restaurant" and not user_context["details_provided"]): print(f"{COLOR_BLUE}Bot:{COLOR_RESET} Please provide your reservation date, time, and party size.") user_context["details_provided"] = True elif ("pickup" not in final_content.lower() and user_context["intent"] == "car_rental" and not user_context["details_provided"]): print(f"{COLOR_BLUE}Bot:{COLOR_RESET} Please provide the pickup location, rental dates, and preferred car type.") user_context["details_provided"] = True else: print(f"{COLOR_BLUE}Bot:{COLOR_RESET} (No response content found)") if __name__ == "__main__": handle_request() How It Works Here’s a breakdown of the main script: Set Up OpenAI API: First, we set up our API key so we can interact with OpenAI's models.Initialize the Swarm Client: This connects us to the Swarm framework, allowing us to run agents.User Interaction Loop: The handle_request() function creates a loop for continuous user interaction, where each input is analyzed and routed to the Triage Agent.Response Handling: Based on the Triage Agent’s decision, the relevant agent responds, and we handle any missing information by prompting the user. Step 3.3: Routing Requests to the Right Agent (triage_agent.py) The Triage Agent decides which agent should handle the user’s request — like a concierge dispatcher. Let’s see the code: Python from swarm import Agent from agents.hotel_agent import hotel_agent from agents.restaurant_agent import restaurant_agent from agents.car_rental_agent import car_rental_agent # Context dictionary to track user intent and state user_context = {"intent": None, "details_provided": False} def transfer_to_hotel(): user_context["intent"] = "hotel" return hotel_agent def transfer_to_restaurant(): user_context["intent"] = "restaurant" return restaurant_agent def transfer_to_car_rental(): user_context["intent"] = "car_rental" return car_rental_agent # Main triage agent triage_agent = Agent( name="Triage Agent", description=""" You are a triage agent that understands the user’s intent and delegates it to the appropriate service: - Book a hotel - Reserve a restaurant - Rent a car Once you identify the user’s intent, immediately transfer the request to the relevant agent. Track user intent and avoid redundant questions by confirming and passing any provided information. """, functions=[transfer_to_hotel, transfer_to_restaurant, transfer_to_car_rental] ) What This Agent Does Here’s how the Triage Agent makes decisions: Intent Tracking: Tracks user intent, ensuring repeated details aren’t requested.Directing Requests: Routes hotel bookings to the Hotel Agent, restaurant bookings to the Restaurant Agent, and car rentals to the Car Rental Agent.Using user_context: Keeps user data in memory for smooth conversation flow. Step 3.4: Arranging Hotel Bookings (hotel_agent.py) The Hotel Agent handles all hotel-related requests, gathering details like location, dates, and budget. Python from swarm import Agent def book_hotel(location, checkin_date, checkout_date, budget): # A mock response for hotel booking return f"Hotel booked in {location} from {checkin_date} to {checkout_date} within a budget of ${budget} per night." hotel_agent = Agent( name="Hotel Agent", description="Handles hotel bookings, including location, dates, and budget.", functions=[book_hotel] ) How the Hotel Agent Operates Here’s what makes this agent effective: Booking Details: Asks for location, check-in/out dates, and budget to complete hotel bookings.Dedicated Functionality: With book_hotel(), this agent is entirely focused on hotels, making it easy to expand or improve without affecting other agents. Step 3.5: Managing Restaurant Reservations (restaurant_agent.py) The Restaurant Agent handles all restaurant-related tasks, such as date, time, and party size. Python from swarm import Agent def reserve_restaurant(location, date, time, party_size): # A mock response for restaurant reservations return f"Restaurant reservation made in {location} on {date} at {time} for {party_size} people." restaurant_agent = Agent( name="Restaurant Agent", description="Manages restaurant reservations, including date, time, and party size.", functions=[reserve_restaurant] ) What the Restaurant Agent Handles Let’s see how it works: Reservation Details: Collects specifics such as location, date, and time.Independent Operations: By handling only restaurant reservations, it ensures a seamless experience without overlapping tasks with other agents. Step 3.5: Arranging Car Rentals (car_rental_agent.py) Finally, the Car Rental Agent handles car rentals by asking for pickup location, rental dates, and car preferences. Python from swarm import Agent def rent_car(location, start_date, end_date, car_type): # A mock response for car rentals return f"Car rental booked at {location} from {start_date} to {end_date} with a {car_type} car." car_rental_agent = Agent( name="Car Rental Agent", description="Arranges car rentals, including pickup location, rental dates, and car type.", functions=[rent_car] ) Inside the Car Rental Agent Here’s how it operates: Rental Details: Manages details like pickup location, dates, and car type.Focused Functionality: With rent_car(), the agent’s focus is entirely on car rentals, keeping it streamlined and easy to modify. Step 4: Running the Program To test our Smart Travel Agent: Navigate to the project folder in your terminal.Run the program: python main.py When prompted, you can enter requests like “I need to book a hotel in Chicago,” and the bot will guide you through providing the necessary details. Interact with the bot as follows: Plain Text You: I need to book a hotel Bot: Great! Could you provide the location, check-in, and check-out dates, and your budget? You: Chicago, check-in 18th Nov, check-out 19th Nov, budget $200 Live Output How Context Is Handled Through the Chat? The user_context dictionary in the triage_agent.py code keeps track of the user’s intent, so agents don’t keep asking for the same details. This context ensures the conversation flows smoothly without repetitive questions. What's Next? Ready to take this tutorial to the next level? Here’s how to make your Smart Travel Agent production-ready: Integrate Real APIs: Connect each agent with real-time APIs for hotels, restaurants, and car rentals. For example: Use Booking.com API for hotels.Integrate OpenTable API for restaurant reservations.Leverage RentalCars API for car rentals.Add Authentication: Protect API calls with secure tokens to prevent unauthorized access.Database Support: Add a database to keep track of previous bookings, user preferences, and chat history.Enhance Context Management: Expand user_context to retain more information across conversations, making interactions more seamless. Conclusion And there you have it! You've built a fully functioning travel concierge using multi-agent systems and OpenAI’s Swarm framework. With each agent handling a specific task, your bot is now smart enough to make travel bookings feel like a breeze.
Large language models (LLMs) like GPT-3, GPT-4, or Google's BERT have become a big part of how artificial intelligence (AI) understands and processes human language. But behind these models' impressive abilities is a hidden process that's easy to overlook: tokenization. This article will explain what tokenization is, why it's so important, and whether or not it can be avoided. Imagine you're reading a book, but instead of words and sentences, the entire text is just one giant string of letters without spaces or punctuation. It would be hard to make sense of anything! That's what it would be like for a computer to process raw text. To make language understandable to a machine, the text needs to be broken down into smaller, digestible parts — these parts are called tokens. What Is Tokenization? Tokenization is the process of splitting text into smaller chunks that are easier for the model to understand. These chunks can be: Words: Most natural unit of language (e.g., "I", "am", "happy").Subwords: Smaller units that help when the model doesn't know the whole word (e.g., "run", "ning" in "running").Characters: In some cases, individual letters or symbols (e.g., "a", "b", "c"). Why Do We Need Tokens? Let's take an example sentence: "The quick brown fox jumps over the lazy dog." A computer sees this sentence as a long sequence of letters: Thequickbrownfoxjumpsoverthelazydog. The computer can't understand this unless we break it down into smaller parts or tokens. Here's what the tokenized version of this sentence might look like: 1. Word-level tokenization: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"] 2. Subword-level tokenization: ["The", "qu", "ick", "bro", "wn", "fox", "jump", "s", "over", "the", "lazy", "dog"] 3. Character-level tokenization: ["T", "h", "e", "q", "u", "i", "c", "k", "b", "r", "o", "w", "n", "f", "o", "x", "j", "u", "m", "p", "s", "o", "v", "e", "r", "t", "h", "e", "l", "a", "z", "y", "d", "o", "g"] The model then learns from these tokens, understanding patterns and relationships. Without tokens, the machine wouldn't know where one word starts and another ends or what part of a word is important. How Tokenization Works in LLMs Large language models don't "understand" language the way humans do. Instead, they analyze patterns in text data. Tokenization is crucial for this because it helps break the text down into a form that's easy for a model to process. Most LLMs use specific tokenization methods: Byte Pair Encoding (BPE) This method combines characters or subwords into frequently used groups. For example, "running" might be split into "run" and "ning." BPE is useful for capturing subword-level patterns. WordPiece This tokenization method is used by BERT and other models. It works similarly to BPE but builds tokens based on their frequency and meaning in context. SentencePiece This is a more general approach to tokenization that can handle languages without clear word boundaries, like Chinese or Japanese. How Tokenization Works in LLMs The way text is broken down can significantly affect how well an LLM performs. Let's dive into some key reasons why tokenization is essential: Efficient Processing Language models need to process massive amounts of text. Tokenization reduces text into manageable pieces, making it easier for the model to handle large datasets without running out of memory or becoming overwhelmed. Handling Unknown Words Sometimes, the model encounters words it hasn't seen before. If the model only understands entire words and comes across something unusual, like "supercalifragilisticexpialidocious," it might not know what to do. Subword tokenization helps by breaking the word down into smaller parts like "super," "cali," and "frag," making it possible for the model to still understand. Multi-Lingual and Complex Texts Different languages structure words in unique ways. Tokenization helps break down words in languages with different alphabets, like Arabic or Chinese, and even handles complex things like hashtags on social media (#ThrowbackThursday). An Example of How Tokenization Helps Let's look at how tokenization can help a model handle a sentence with a complicated word. Imagine a language model is given this sentence: "Artificial intelligence is transforming industries at an unprecedented rate." Without tokenization, the model might struggle with understanding the entire sentence. However, when tokenized, it looks like this: Tokenized version (subwords): ["Artificial", "intelligence", "is", "transform", "ing", "industr", "ies", "at", "an", "unprecedented", "rate"] Now, even though "transforming" and "industries" might be tricky words, the model breaks them into simpler parts ("transform", "ing", "industr", "ies"). This makes it easier for the model to learn from them. Challenges in Tokenization While tokenization is essential, it's not perfect. There are a few challenges: Languages Without Spaces Some languages, like Chinese or Thai, don't have spaces between words. This makes tokenization difficult because the model has to decide where one word ends and another begins. Ambiguous Words Tokenization can struggle when a word has multiple meanings. For example, the word "lead" could mean a metal or being in charge. The tokenization process can't always determine the correct meaning based on tokens alone. Rare Words LLMs often encounter rare words or invented terms, especially on the internet. If a word isn't in the model's vocabulary, the tokenization process might split it into awkward or unhelpful tokens. Can We Avoid Tokenization? Given its importance, the next question is whether tokenization can be avoided. In theory, it's possible to build models that don't rely on tokenization by directly working at the character level (i.e., treating every single character as a token). But there are drawbacks to this approach: Higher Computational Costs Working with characters requires much more computation. Instead of processing just a few tokens for a sentence, the model would need to process hundreds of characters. This significantly increases the model's memory and processing time. Loss of Meaning Characters don't always hold meaning on their own. For example, the letter "a" in "apple" and "a" in "cat" are the same, but the words have completely different meanings. Without tokens to guide the model, it can be harder for the AI to grasp context. That being said, some experimental models are trying to move away from tokenization. But for now, tokenization remains the most efficient and effective way for LLMs to process language. Conclusion Tokenization might seem like a simple task, but it's fundamental to how large language models understand and process human language. Without it, LLMs would struggle to make sense of text, handle different languages, or process rare words. While some research is looking into alternatives to tokenization, for now, it's an essential part of how LLMs work. The next time you use a language model, whether it's answering a question, translating a text, or writing a poem, remember: it's all made possible by tokenization, which breaks down words into parts so that AI can better understand and respond. Key Takeaways Tokenization is the process of breaking text into smaller, more manageable pieces called tokens.Tokens can be words, subwords, or individual characters.Tokenization is crucial for models to efficiently process text, handle unknown words, and work across different languages.While alternatives exist, tokenization remains an essential part of modern LLMs.
This tutorial will walk through the setup of a scalable and efficient MLOps pipeline designed specifically for managing large language models (LLMs) and Retrieval-Augmented Generation (RAG) models. We’ll cover each stage, from data ingestion and model training to deployment, monitoring, and drift detection, giving you the tools to manage large-scale AI applications effectively. Prerequisites Knowledge of Python for scripting and automating pipeline tasks.Experience with Docker and Kubernetes for containerization and orchestration.Access to a cloud platform (like AWS, GCP, or Azure) for scalable deployment.Familiarity with ML frameworks (such as PyTorch and Hugging Face Transformers) for model handling. Tools and Frameworks Docker for containerizationKubernetes or Kubeflow for orchestrationMLflow for model tracking and versioningEvidently AI for model monitoring and drift detectionElasticsearch or Redis for retrieval in RAG Step-by-Step Guide Step 1: Setting Up the Environment and Data Ingestion 1. Create a Docker Image for Your Model Begin by setting up a Docker environment to hold your LLM and RAG model. Use the Hugging Face Transformers library to load your LLM and define any preprocessing steps required for data. Dockerfile FROM python:3.8 WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt COPY . . CMD ["python", "app.py"] Tip: Keep dependencies minimal for faster container spin-up. 2. Data Ingestion Pipeline Build a data pipeline that pulls data from your database or storage. If using RAG, connect your data pipeline to a database like Elasticsearch or Redis to handle document retrieval. This pipeline can run as a separate Docker container, reading in real-time data. Python # ingestion_pipeline.py from elasticsearch import Elasticsearch def ingest_data(): es = Elasticsearch() # Add data ingestion logic Step 2: Model Training and Fine-Tuning With MLOps Integration 1. Integrate MLflow for Experiment Tracking MLflow is essential for tracking different model versions and monitoring their performance metrics. Set up an MLflow server to log metrics, configurations, and artifacts. Python import mlflow with mlflow.start_run(): # Log model parameters and metrics mlflow.log_metric("accuracy", accuracy) mlflow.log_artifact("model", "/path/to/model") 2. Fine-Tuning With Transformers Use the Hugging Face Transformers library to fine-tune your LLM or set up RAG by combining it with a retrieval model. Save checkpoints at each stage so MLflow can track the fine-tuning progress. Python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large") tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large") # Fine-tune model Step 3: Deploying Models With Kubernetes 1. Containerize Your Model With Docker Package your fine-tuned model into a Docker container. This is essential for scalable deployments in Kubernetes. 2. Setup Kubernetes and Deploy With Helm Define a Helm chart for managing the Kubernetes deployment. This chart should include resource requests and limits for scalable model inference. YAML # deployment.yaml file apiVersion: apps/v1 kind: Deployment metadata: name: model-deployment spec: replicas: 3 template: spec: containers: - name: model-container image: model_image:latest ports: - containerPort: 5000 3. Configure Horizontal Pod Autoscaler (HPA) Use HPA to scale pods up or down based on traffic load. Shell kubectl autoscale deployment model-deployment --cpu-percent=80 --min=2 --max=10 Step 4: Real-Time Monitoring and Drift Detection 1. Set Up Monitoring With Evidently AI Integrate Evidently AI to monitor the performance of your model in production. Configure alerts for drift detection, allowing you to retrain the model if data patterns change. Python # pythonfile import evidently from evidently.model_profile import Profile from evidently.model_profile.sections import DataDriftProfileSection profile = Profile(sections=[DataDriftProfileSection()]) profile.calculate(reference_data, production_data) 2. Enable Logging and Alerting Set up logging through Prometheus and Grafana for detailed metrics tracking. This will help monitor real-time CPU, memory usage, and inference latency. Step 5: Automating Retraining and CI/CD Pipelines 1. Create a CI/CD Pipeline With GitHub Actions Automate the retraining process using GitHub Actions or another CI/CD tool. This pipeline should: Pull the latest data for model retraining.Update the model on the MLflow server.Redeploy the container if performance metrics drop below a threshold. YAML name: CI/CD Pipeline on: [push] jobs: build: runs-on: ubuntu-latest steps: - name: Checkout code uses: actions/checkout@v2 - name: Build Docker image run: docker build -t model_image:latest . 2. Integrate With MLflow for Model Versioning Each retrained model is logged to MLflow with a new version number. If the latest version outperforms the previous model, it is deployed automatically. Step 6: Ensuring Security and Compliance 1. Data Encryption Encrypt sensitive data at rest and in transit. Use tools like HashiCorp Vault to manage secrets securely. 2. Regular Audits and Model Explainability To maintain compliance, set up regular audits and utilize explainability tools (like SHAP) for interpretable insights, ensuring the model meets ethical guidelines. Wrapping Up After following these steps, you’ll have a robust MLOps pipeline capable of managing LLMs, RAG models, and real-time monitoring for scalable production environments. This framework supports automatic retraining, scaling, and real-time responsiveness, which is crucial for modern AI applications.
Data is a critical component to all aspects of the world in 2024. It is more valuable than most commodities, and there is an exponentially increasing need to more safely and accurately share, use, store, and organize this data. Data architecture is just that: the rules and guidelines that users must follow when storing and using data. There is significant benefit to housing and conglomerating this data management into a single unified platform, but there are also emerging challenges such as data complexities and security considerations that will make this streamlining ever more complicated. The popularity of generative AI (commonly know as GenAI) that is steamrolling the technology industry will mean that data architecture will be completely changed in this revolutionary, modern era. Unsurprisingly, since this modernization is taking the world by storm in a very quick and competitive fashion, there are heightening stressors and pressures to adhere to it quickly. While there are projections that 80% of enterprises will incorporate GenAI APIs or GenAI-enabled applications, less than 25% of banking institutions have implemented their critical data into the target architecture; this is only one industry. There is a need to move away from data silos and onto the newer and modern data fabric and data mesh. Data Silos Are Old News — It's About Data Fabrics and Data Meshes In the automotive industry, among others, there has been a noticed need to move away from the outdated data silos. With data silos, information is inaccessible. It is gridlocked for one organization only. This hinders any communication or development and pigeonholes data into a single use without considering the transformation and evolution that can occur if it is viewed as a shared asset. A data fabric is an approach to unite data management. As mentioned, data is often gridlocked away and data fabrics aim to unlock it at the macro level and made available to multiple entities for numerous, differentiated purposes. A data mesh separates data into products and delivers them to all parties as decentralized and with their own individualized governances. This transition to modern data architecture is also altered by the adoption of artificial intelligence (AI). AI can help to locate sophisticated patterns, generate predictions, and even automate many processes. This can improve accuracy and largely benefit scalability and flexibility. However, there are also challenges of data quality, transparency, ethical and legal factors, and integration hiccups. This leads to many strategies and insights that can help to guide and smooth out the progression from traditional to modern data architecture. Key Strategies Build a Minimal Viable Product First Accelerating results in data architecture initiatives can be achieved in a much quicker fashion if you start with the minimum needed and build from there for your data storage. Begin by considering all use cases and finding the one component needed to develop so a data product can be delivered. Expansion can happen over time with use and feedback, which will actually create a more tailored and desirable product. Educate, Educate, Educate Educating your key personnel on the importance of being able and ready to make the shift from previously familiar legacy data systems to modern architectures like data lakehouses or hybrid cloud platforms. Migration to a unified, hybrid, or cloud-based data management system may seem challenging initially, but it is essential for enabling comprehensive data lifecycle management and AI-readiness. By investing in continuous education and training, organizations can enhance data literacy, simplify processes, and improve long-term data governance, positioning themselves for scalable and secure analytics practices. Anticipate the Challenges of AI By being prepared for the typical challenges of AI, problems can be predicted and anticipated which can help to reduce downtime and frustration in the modernization of data architecture. Some of the primary ones are: data quality, data volume, data privacy, and bias and fairness. Data cleaning, profiling, and labeling, bias mitigation, validation and testing, monitoring, edge computing, multimodal learning, federated learning, anomaly detection, and data protection regulations can all assist minimize the obstacles caused by AI. Key Insights Unifying Data Is Beneficial for Competition It is almost a unanimous decision that unifying data is useful for businesses. It helps with simplifying processes, gaining flexibility, enhancing data governance and security, enabling easier integration with new tools and models for AI, and improving scalability. Data fabric brings value for business and can increase competitive advantage by understanding the five competitive forces: new entrants, supplier bargaining, buyer bargaining, competitor rivalries, and substitute product/service threats. Data Is a Product There is a view that data should be domain-driven, viewed and handled as an asset, self-served on a platform, and undergo federated, computational governance. This is achieved through separation of data by domain and type; the incorporation of metadata for data to exist and be explained in its own, isolated format; the ability to search and locate data independently; and a supportive and organized housing structure. Handling Multiple Sources of Data Is Challenging It is critical to remember that combinations of data from numerous sources is difficult. Real-time capabilities for some processes like fraud detection, online shopping, and healthcare just simply are not ready yet. Standards and policies need to be adopted. There will be inevitable trouble with managing all clouds and data sources, potential security breaches and governance struggles, and the necessity for continuous development and customization. Modern Data Architecture Will Forge Ahead With the Advent of AI Despite the difficulties and complexities of updating the existing and traditional data architecture methods, there is no doubt that modern data architecture will also include AI. AI will continue to grow and help organizations use data in a prescriptive way, instead of a descriptive way. Although many people are wary of AI, there is still the overwhelming hope and vision that it will create opportunity, maximize output, and power innovation in all markets, including data structure and management. Those in the wake of AI and modern data architecture will know the benefits of more productivity and operational efficiency, enhanced customer experience, and risk management.
When it's not compared to a magical or evil entity, artificial intelligence (AI) is often reduced to a single term: software. This simplification might obscure the complexity and the rich structured interplay of elements that build what really are AI systems. Even though I'd rather hear AI described as software than listen to stories about its consciousness or free will, let's discuss why AI is far more than just a piece of code. Defining AI At its heart, AI is the creation of systems that can simulate human reasoning, allowing machines to analyze, deduce, and decide based on programmed logic and learned knowledge. The famous 1955 Dartmouth proposal by John McCarthy, Marvin Minsky, Nathaniel Rochester, and Claude Shannon provides this guiding idea: "...the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. …find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves.” What Is a "System"? When discussing AI as a system, we refer to a structured network that goes beyond traditional software functions: System as a Computational Framework: A structured collection of components that work together to simulate reasoning and decision-making.System as a Learning Entity: AI systems include data structures and algorithms that allow them to adapt based on new information, continuously refining their actions.System as an Automated Problem Solver: These systems perform complex problem-solving functions more or less autonomously, organizing information to make logical decisions.System as an Abstraction Mechanism: AI systems form abstractions, recognize patterns, and interpret data in ways that tend to mimic cognitive processes.System as a Network of Interacting Components: AI includes interconnected elements like neural networks, algorithms, and feedback loops that simulate cognitive tasks like learning and language processing. Based on these five aspects, we can define an AI system as an adaptive, interconnected network of components capable of learning, interacting with the real world, and solving complex problems autonomously. Components of an AI System An AI system is made up of several key parts. Here are some examples: Algorithms: Step-by-step methods that enable decision-making and problem-solving.Data Structures: Formats that store and organize data for easy access and modification.Neural Networks: Layers of interconnected nodes, either in software or hardware, that enable pattern recognition and learning.Memory Capabilities: Components and structures that store and manage information, enabling AI to retain and leverage past learning experiences to make more informed and adaptive future decisions.Sensors: Input devices that gather real-world data, such as cameras, microphones, temperature, or movement-measuring devices, allow the AI to respond to its environment.Logic Gates: Basic hardware elements that perform conditional operations and control information flow.Feedback Loops: Mechanisms for self-evaluation, enabling the system to refine its actions.Training Dataset: Collections of labeled data used for machine learning used for pattern recognition.Inference Engine: A logic-driven mechanism that transforms raw data into actionable insights supporting complex AI problem-solving.User Interface (UI): The interface through which users interact with the AI to enter instructions and receive outputs. See? When we talk about AI as a system, we're really describing a whole network of interconnected components that go far beyond basic software. Think of it as a structured framework that puts together many heterogeneous software and hardware parts that all work and interact in a common shared infrastructure. Most importantly, it doesn't just follow instructions; it learns and adapts to new information, solves problems on its own, and even starts to form patterns and interpretations. Why Is AI More Than Software? Software is "simply" a collection of instructions and data that instructs a computer on performing specific tasks. While software is definitely a part of AI, it's limited in scope. It typically executes isolated tasks and follows fixed instructions without adaptation. Look at what AI can do. They run complex, cross-functional challenges far beyond the reach of conventional software, like interpreting medical images to diagnose diseases, processing natural language to understand human intent, optimizing large-scale supply chains, and autonomously navigating vehicles in unpredictable environments. When defining AI as a "system," we want to refer to a dynamic network of interconnected components working together rather than a static piece of software. Unlike traditional software, an AI system adapts and learns continuously. It evolves through feedback and new data to refine and improve its processes. Real-world interaction is essential to these systems, which often include sensors and interfaces to process environmental data, which is absent in most software. AI systems are built with self-improvement mechanisms, such as feedback loops and inference engines. These mechanisms enable them to adjust and enhance their decision-making abilities independently. Conclusion Today, people appear fascinated by the concept of "intelligent machines." Many envision AI as a miraculous force capable of solving humanity's greatest challenges or an imminent threat with potentially catastrophic consequences. These opposite views often blur the reality of what AI systems are actually designed to do. Then, it becomes tempting to reduce AI to just a bunch of software to bring expectations back down to earth. If I had to pick one word to explain or define what an "AI System" is, I would choose "nexus." From my point of view, "nexus" captures the idea of AI as a complex, interconnected network that is more than just isolated "software." A "nexus" suggests a focal point or hub where different elements come together and interact in meaningful ways. In the case of AI, this includes not only the algorithms, data structures, and hardware but also the indispensable role of humans within the ecosystem. We bring qualities that only humans possess, like moral judgment, ethical reasoning, common sense, empathy, and cultural awareness. One More Thing From my perspective, these uniquely human traits should always guide AI development and usage. They can ensure that AI developments and usages align with societal values, respond appropriately in complex situations, and respect the nuances of human experience (also, these concepts might differ depending on culture). Through these contributions, humans are in a position to act as both architects and guardians within the AI nexus. They are responsible for nurturing systems that not only perform but do so responsibly and ethically.
Carefully reviewing the code line by line and trying to grasp the complex logic behind the algorithm can be a tedious task for developers, especially when working with large and intricate codebases. This approach can be time-consuming and overwhelming as the large codebases make identifying all potential test scenarios difficult. Fortunately, code graph tools can automate this process and provide a visual representation of the code through graphs, simplifying the task and enhancing overall efficieny. This article will explore the concept of code graphs, how they enhance code analysis, simplify debugging, and facilitate impact analysis, and how some tools can make all of these tasks easier. We will also discuss the challenges in current solutions for code analysis and the advantages of using knowledge graphs over vector databases for code analysis. What Is a Code Graph? A code graph visually represents the structural relationships within a codebase. It maps functions, classes, and variables as nodes and their relationships (such as function calls, class inheritances, and variable dependencies) as edges. This structured representation enhances code analysis by making complex codebases easier to understand and navigate. Code graphs can act as a roadmap, giving you a clear view of how the different parts of your code fit together. To help bring this concept to life, some tools can make it easier to visualize and navigate your code. One example is Code Graph, a visualization tool in Visual Studio (2012-2017) that uses code graphs to allow users to explore code more conveniently. Representing code as a graph has been heavily used in compilers and IDEs for various tasks. Presenting the graphical structure of code to any Graph ML algorithms creates SOTA results. Functions, classes, and variables can be nodes in a codebase. Edges can represent function calls, variable usage, or class inheritance. For instance, a node representing a function might have edges pointing to nodes representing the variable it uses and the functions it calls. A code graph node linked to two functions Code graph representation allows for a detailed analysis of the code's structure and behavior, facilitating tasks like code navigation, impact analysis, and debugging. By representing code as a graph, we capture intricate details about how different parts of the code interact, making it easier to analyze and understand complex codebases. How is it done? The code is divided into the following elements: Definitions: Where things (like functions, classes, variables) are defined.References: Where those things are used or called.Symbols: Names given to elements in your code (like function names and class names).Doc Comments: Comments that explain the code, usually written in a specific format. Further down, we will see examples of how the graph is generated for the given code. How Code Graphs Enhance Code Analysis Code graphs provide several benefits for code analysis: Dependency Visualization With Code Graph, developers or testers can visualize dependencies between different parts of the code. It will become easy to see how functions, classes, and modules depend on each other. Imagine a large codebase with a function calculate_volume, which has a calculate_area function and depends on helper functions to get length and width. A code graph would illustrate these dependencies clearly, allowing you to quickly identify potential issues or areas for optimization. Simplified Debugging Code graphs simplify debugging by showing how functions and classes interact. Let's say a developer is debugging an issue with the calculate_volume function. By looking at Code Graph, they can quickly see that the issue might be caused by a problem in the calculate_area function, called calculate_volume. The developer can then focus their debugging efforts on calculate_area and its dependencies, get_length and get_width. Impact Analysis Developers can quickly assess the impact of changes in one part of the code on other parts. This is because they can check which functions or classes depend on the code they will modify. Accordingly, they can make informed decisions. Improved Code Quality Identifying and understanding code relationships help maintain and improve code quality, but how? Now, developers can figure out the code duplication, which can then be refactored to improve the codebase. Challenges in RAG Solutions for Code Analysis Large Codebase Due to the large amount of code, Retrieval Augmented Generation (RAG) models have difficulty retrieving relevant code snippets. When processing a vast software system, the RAG model would get a thousand code snippets, and to pick the best one, we would read hundreds of similar-looking code snippets. Code Redundancy RAG models might produce redundant code, leading to duplicated code and possible loss of efficiency. For example, RAG models for an invariant generation of code may provide multiple looking-alike solutions to a particular task, and it seems too hard to compare them to find the best solution. Advantages of Using Knowledge Graphs Over Vector Databases for Code Analysis Knowledge graphs offer several advantages over vector databases for code analysis. Let’s understand this with an example. Suppose a developer gave this prompt. Prompt: Search code regarding updateInventory(). See what results the knowledge graph and vector database will provide below. Knowledge Graph The query returns a detailed graph highlighting every method, class, and service that directly or indirectly calls updateInventory(). Thus, the knowledge graph will check all the related functions, classes, and services and their relationship with updateInventory() before giving the results to the query, as shown below. OrderService: updateInventory() is called to update stock levels after a purchase.ReturnService: The function is used to restock items when returns are processed.AuditService: It logs inventory changes for auditing purposes.ExternalAPI: The function interacts with an external API to synchronize inventory data.PerformanceMetrics: The graph includes performance data showing that updateInventory() has bottlenecks during peak times. This will ensure that the returned results are accurate and reliable, as all the components related to updateInventory() and their relationship with it are considered. This helps Code Graph to represent accurate code visualizations. Vector Database Vector databases are useful for finding similar code snippets but cannot effectively represent detailed, contextual relationships. The search returns functions that are structurally and content-wise similar to updateInventory. Why? Vector databases can provide results based on similarity search or Eucleadian distance. Plain Text [FunctionX] --similar_to--> [updateInventory] [FunctionY] --similar_to--> [updateInventory] [FunctionZ] --similar_to--> [updateInventory] Visualizing Your Code With a Code Graph Example 1 One example demonstrates basic function definitions and calls in Python. It shows simple arithmetic operations like multiplying, adding, and printing the results. Example 2 Another example demonstrates a simple recursive function for calculating a number's factorial and how to call it within a main function. There are many code graph tools available online where you could simply paste the entire code. Another alternative is to make graphs manually using Lucidchart. Understanding the Code Graph Workflow Let’s understand it with an example. Imagine a Python project with several files, including math_utils.py containing a function calculate_area() and shapes.py with a class Circle. The indexing step would extract the function and class definitions and their relationships, such as that Circle uses calculate_area(). The workflow of Code Graph typically involves: Step 1: Indexing In this step, the source code files parse the codebase, extracting relevant information such as functions, classes, variables, and their relationships. Step 2: Building the Code Graph The code graph for our example would contain nodes for calculate_area() and Circle, with an edge connecting Circle to calculate_area(), indicating that Circle uses the calculate_area() function. Step 3: Querying the Code Graph The User can query the code graph to find all functions used by the Circle class. The query would return a list of functions by checking the nodes and entities connected with them. This can be done using graph query languages like Cypher or Gremlin. Step 4: Visualization and Exploration The visualization might show a node for Circle with an edge pointing to calculate_area(), indicating the dependency. This visualization helps developers quickly identify the relationships between code entities. Step 5: Analysis and Insight By analyzing the code graph, we might discover that the Circle class is tightly coupled to the calculate_area() function, which could lead to maintenance issues. We could also identify that the calculate_area() function is duplicated in another part of the codebase. Interacting With OpenAI for Transforming Queries Sometimes, you may also interact with query transformation with the OpenAI Codex model, which can be fine-tuned for several code transformation tasks, such as refactoring the existing code using OpenAI code sampling and transforming a table using SQL Codex Art. For example, given a dataset in a CSV file, write an SQL query to extract some information from the dataset. Autocomplete: OpenAI's model can complete an incomplete code using machine learning, reducing developers' time.Code conversion: The model can translate code from one programming language to another, which makes it straightforward to relocate projects between languages.CodeOpt: OpenAI open-sourced their model for code optimization, thus helping to enhance the code's performance. Overall, it saves a lot of computational resources in return for better efficiency.Code explanation: It helps the model convert obscure code snippets into simpler words, which makes it easy for developers to comprehend and learn code from each other. Detailed Knowledge Graph Schema A knowledge graph schema is an understanding of the nature of where the data lies. It defines all details, relations amongst entities, attributes or concepts, and the kind of everything present inside the knowledge graph. It offers a standardized way of organizing and connecting data, allowing machines to interpret the significance and relationship of this information. Let’s understand this with a hypothetical knowledge graph about movies: Entities 1. Movie: Represents a movie entity. Properties: Title (string), Release Date (date), Director (person), Genre (string), Rating (float), Box Office Collection (float), Synopsis (text) 2. Person: Represents a person involved in the movie industry. Properties: Name (string), Date of Birth (date), Place of Birth (string), Biography (text), Image (URL) 3. Genre: Represents a genre of movies. Properties: Name (string), Description (text) 4. Studio: Represents a movie production studio. Properties: Name (string), Headquarters (string), Founded (date), Description (text), Image (URL) 5. Award: Represents an award given for movies. Properties: Name (string), Category (string), Year (date), Recipient (person or movie) Building the Code Graph First, clone the FalkorDB Code Graph repository. Plain Text git clone https://github.com/FalkorDB/code-graph.git Run FalkorDB. Plain Text docker run -p 6379:6379 -it --rm falkordb/falkordb Set your OpenAI API key as an environment variable. You will need it to generate cipher queries for the knowledge graph and answer RAG questions related to the code graph. Plain Text export OPENAI_API_KEY=YOUR_OPENAI_API_KEY Launch the FalkorDB Code Graph tool. Plain Text npm run dev This will launch a server at http://localhost:3000/. You can enter the GitHub URL of any repository, and it will generate the code graph for you. You can also ask questions about the code graph in the side panel, and it will reply in natural language. This feature is handy when navigating a programming framework's complex and vast codebase. Future Work There is significant potential for improving code graphs, particularly in enhancing their integration with various development tools and platforms. One key aspect involves ensuring real-time updates to keep the Code Graph synchronized with changes in the codebase. Another crucial area for development is expanding the range of supported programming languages, enabling code graphs to be more versatile and applicable across different development environments. Additionally, leveraging machine learning for predictive analysis and code recommendations holds immense potential in further improving the utility and effectiveness of code graphs. These advancements can help developers with a more comprehensive understanding of their codebases, enabling them to conduct more thorough code analysis and ultimately enhance overall code quality.
Tuhin Chattopadhyay
CEO at Tuhin AI Advisory and Professor of Practice,
JAGSoM
Yifei Wang
Senior Machine Learning Engineer,
Meta
Austin Gil
Developer Advocate,
Akamai
Tim Spann
Principal Developer Advocate and Field Engineer,
Data In Motion