Extracting Insights from Articles using Groq and Llama 3.1
Jim Cramer recently posted on X about AI use cases. He said:
"If i had to put a percentage on it, i would say that 75% of the "use cases" I hear for artificial intelligence are just bogus. The most authentic is Meta and then Amazon..."
I am not a betting man, but I do believe that today I will show you an AI use case that is not bogus.
Groq is a cutting-edge "Fast AI Inference" cloud provider that leverages their unique LPUs instead of traditional GPUs. This innovative approach allows Groq to deliver unparalleled speed and efficiency in running AI models, making it a powerful choice for AI applications that require low latency.
Llama 3.1 is Meta's latest AI model, released earlier this month, and is already making waves in the AI community. I'm using the 70b versatile version of Llama 3.1 on Groq because it offers higher token limits, which are essential for processing and extracting insights from lengthy articles.
In this article, I'll walk you through how I combine Groq's fast AI inference capabilities with Llama 3.1 to efficiently extract meaningful insights from articles, showcasing a practical AI use case that's anything but bogus.
The logic
-
Find the URL of the article you want to extract insights from. (i.e. https://groff.dev/blog/ingesting-pdfs-with-gpt-vision)
-
Get the article's HTML content.
- Convert the HTML content to Markdown.
- Python example:
html2text
library - JavaScript example:
turndown
- Use Groq and Llama 3.1 to "clean up" the Markdown content.
- Removing unnecessary elements (e.g. ads, navigation, etc.)
- Removing images, styles, JavaScript, etc.
- Asking for just the author, date, and content of the article.
-
Determine the type of article using JSON mode with Groq and Llama 3.1 (politics, sports, tech, etc.) since the insights you may want to extract could be different based on the type of article.
-
Based on the type of article, extract insights using a different System Message to Groq and Llama 3.1, explaining what fields you want to extract.
- For a development blog article, you may want to extract code snippets, libraries, and tools mentioned.
- For a sports article, you may want to extract player names, teams, and scores.
- For a politics article, you may want to extract politician names, parties, and bills.
The why
Why would you want to extract insights from articles? What can I do with them?
Building a Knowledge Graph to store, visualize, and query relationships between entities. Examples of what you can do with a Knowledge Graph:
- You could build a graph of all the politicians mentioned in articles and the bills they have sponsored.
- You could build a graph of all the tools and libraries mentioned in development blog articles.
- You could build a graph of all the players mentioned in sports articles and the teams they play for.
But why build a Knowledge Graph? Because you can then query it to answer questions like:
- "What are the most popular tools and libraries mentioned in development blog articles?"
- "Which politicians have taken donations from the same companies?"
- "Which players have played for the most teams?"
If you don't want to just take my word for it: There was a great talk by Dhagash Mehta from BlackRock's Applied AI Research at NVIDIA GTC about how they scrape financial documents to build their Knowledge Graph. You'll need to make a free NVIDIA developer account to sign up, but it's worth the hassle to get access to these talks from GTC.
The code
I'm going to include some code snippets here to show you how you can use Groq and Llama 3.1 to extract insights from articles based on the logic I outlined above. This is an example in Python, but there's no reason why you couldn't apply this to any programming language.
Step 1: Install necessary packages
!pip install html2text==2024.2.26 groq==0.9.0
Step 2: Import required libraries
import json
import html2text
import requests
from groq import Groq
I am using the official groq
Python sdk to call Groq, but you can use the OpenAI package thanks to Groq's OpenAI compatibility.
Step 3: Define the URL variable
url = "https://groff.dev/blog/ingesting-pdfs-with-gpt-vision"
You would replace this with the URL of the article you want to extract insights from.
Step 4: Get the article's HTML content
response = requests.get(url)
response.raise_for_status()
html_content = response.text
Step 5: Convert the HTML content to Markdown
html_converter = html2text.HTML2Text()
html_converter.ignore_images = True
html_converter.ignore_links = True
markdown_content = html_converter.handle(html_content)
At this stage we will have a Markdown version of the article but also everything else on the webpage including navigation elements, ads, etc.
Step 6: Use Groq and Llama 3.1 to "clean up" the Markdown content
client = Groq()
completion = client.chat.completions.create(
model="llama-3.1-70b-versatile",
messages=[
{
"role": "system",
"content": "You are an article extraction expert. You take in the markdown content of a website and return just the content the user asks for. Do not preface it with 'Here is the article' or append 'Here you go' or 'How can I help you?' No intros or follow-ups just what is asked for."
},
{
"role": "user",
"content": "Give the user just the article content in markdown format, including only the date, the author, and the article. Do not include links that are outside of the article such as navigation links. Do not include images. Do not include JavaScript tags.\n\n---\n\n" + markdown_content
}
],
temperature=0,
max_tokens=8000,
top_p=1,
stream=False,
stop=None,
)
refined_article = completion.choices[0].message.content
We now have a "cleaned up" version of the article that only includes the author, date, and content of the article. This step may not be absolutely necessary, but it can help to remove unnecessary elements from the article which could confuse Llama 3.1 and add extract tokens we don't care about. I say better to be safe than sorry, especially when working with models that don't have the best in class reasonsing capabilities.
Step 7: Determine the type of article using JSON mode with Groq and Llama 3.1
article_type_recognition_system_message = """
# Your Purpose
1. You will determine the type of the article content provided by the user.
2. You will not try to generate or summarize the article content.
3. You will only determine the article type based on the content provided.
# Article Type Definitions
- "SPORTS": Indicates that the article is about sports, athletes, or sporting events.
- "POLITICS": Indicates that the article is about politics, politicians, or political events.
- "DEVELOPMENT": Indicates that the article is about software development or programming.
- "NOT_APPLICABLE": Indicates that the article type is not one of the above.
# Valid Article Types
- "SPORTS"
- "POLITICS"
- "DEVELOPMENT"
- "NOT_APPLICABLE"
# Output Format - JSON
You will respond in a valid JSON format with the article type and the reasoning behind choosing that article type over the others.
{
"reasoning": "<The reasoning behind selecting this article type>",
"article_type_recognized": "<article_type_recognized>"
}
"""
completion = client.chat.completions.create(
model="llama-3.1-70b-versatile",
messages=[
{
"role": "system",
"content": article_type_recognition_system_message
},
{
"role": "user",
"content": refined_article
}
],
response_format={"type": "json_object"},
temperature=0,
max_tokens=8000,
top_p=1,
stream=False,
stop=None,
)
article_type_result = json.loads(completion.choices[0].message.content)
article_type = article_type_result['article_type_recognized']
I use this System Message pattern when prompting models to determine the type of content. It's a simple way to get a model to classify content based on the content provided. You can further improve the prompt by providing examples of text that would classify an article as a certain type. I find that including a NOT_APPLICABLE case helps the model not hallucinate it's own article types that are not valid in addition to repeating the only valid options it can choose. Even more important I find is making the model output it's reasonsing before the option is chosen. This is similar to the popular "think step by step" prompt engineering technique and in my experience either improves results or at the very least you can see if the reason the model output is non-sensical.
Step 8: Based on the type of article, extract insights using a different System Message to Groq and Llama 3.1
match article_type:
case "SPORTS":
extraction_system_message = """
# Your Purpose
1. You will extract insights related to sports from the article content provided by the user.
2. You will not generate or summarize the article content.
3. You will only extract insights related to sports.
# Output Format - JSON
You will respond in a valid JSON format with the extracted insights related to sports.
{
"sport_being_played": "<sport_being_played>",
"teams_playing": ["<team_1>", "<team_2>"],
"scores": {
"<team_1>": <score_team_1>,
"<team_2>": <score_team_2>
},
"mvp_player": "<mvp_player>"
}
"""
case "POLITICS":
extraction_system_message = """
# Your Purpose
1. You will extract insights related to politics from the article content provided by the user.
2. You will not generate or summarize the article content.
3. You will only extract insights related to politics.
# Output Format - JSON
You will respond in a valid JSON format with the extracted insights related to politics.
{
"politician_names": ["<politician_1>", "<politician_2>"],
"parties": ["<party_1>", "<party_2>"],
"bills_sponsored": ["<bill_1>", "<bill_2>"]
}
"""
case "DEVELOPMENT":
extraction_system_message = """
# Your Purpose
1. You will extract insights related to development from the article content provided by the user.
2. You will not generate or summarize the article content.
3. You will only extract insights related to development.
# Output Format - JSON
You will respond in a valid JSON format with the extracted insights related to development.
{
"code_snippets": ["<code_snippet_1>", "<code_snippet_2>"],
"libraries_mentioned": ["<library_1>", "<library_2>"],
"tools_mentioned": ["<tool_1>", "<tool_2>"]
}
"""
completion = client.chat.completions.create(
model="llama-3.1-70b-versatile",
messages=[
{
"role": "system",
"content": sports_insights_system_message
},
{
"role": "user",
"content": refined_article
}
],
response_format={"type": "json_object"},
temperature=0,
max_tokens=8000,
top_p=1,
stream=False,
stop=None,
)
extracted_info = completion.choices[0].message.content
These are just simple examples of what you can do once you know the type of article you want to extract insights from. You can customize the System Message to extract the insights that you find useful. If we go back to the Knowledge Graph example from before, instead of asking for JSON in this step we could ask for Cypher queries to insert the data into a Neo4j Graph database where you can then query the data to answer questions like the ones I mentioned before. Check out that video from NVIDIA GTC I mentioned earlier to see how they use an LLM to create Cypher queries to insert data into a Neo4j Graph database. There is a course from neo4j on "Knowledge Graphs for RAG" that I found was a great introduction to Knowledge Graphs and working with them alongside LLMs.
My hope for this article is that it inspires you to think about more useful AI use cases so we can prove Jim Cramer wrong about 75% of AI use cases being bogus.
As always feel free to connect with me on LinkedIn. I'd love to hear your feedback and thoughts.