Building a Simple Content Retrieval Web App with Pinecone and Python Flask

Jan 24

2 min read

Overview

This proof-of-concept project, developed during the COVID-19 pandemic, showcases an end-to-end Natural Language Processing (NLP) pipeline, utilizing the Pinecone vector database and Python Flask. By leveraging the Pinecone-Python SDK, the system performs a similarity search to identify articles that match or closely resemble the user's input.

For the purpose of this demonstration, I have created a free Pincone account. While the free account does have limitations, it is sufficient to meet our needs.

In this post, we will walk through the key stages of the process:

Converting and pre-processing an article dataset into vector embeddings
Indexing the vector embeddings into a database
Querying the database to retrieve similar articles
Using a similarity metric to return the nearest neighbors

Although this project was created three years ago, it remains a valuable reference for building efficient content retrieval systems.

Dataset

This demonstration used the "All the News" dataset on Kaggle. This dataset includes articles from 2000 to 2017, with a significant portion of the content from 2016 to 2017 covering a broad range of topics. It contains 9 attributes: ID, Title, Publication, Author, Date, Year, Month, URL, and Content, and features 143,000 news articles collected from 15 different sources. For this project, however, we focused on the first 2,000 articles.

Project tree

Pre-process data to use for vector embeddings

First, we initialize and create the Pinecone index to store the vector embeddings and handle queries to retrieve relevant results efficiently.

Next, we leverage a pre-trained model from HuggingFace, SentenceTransformer, to generate vector embeddings based on the Title and Article columns.

Index the vector embeddings into a database

The vector embeddings are inserted into a Pinecone-managed vector database index. This index uses cosine similarity as the metric to compare the embeddings, efficiently identifying sentences with similar meanings.

Querying the database to retrieve similar articles

Once the vector embeddings are added to the database and indexed, the user is ready to begin finding similar content. To do this, we start by taking the user's query and converting it into vectors using the model from step #1. When the user submits an article text as input, an API request is made, utilizing Pinecone’s SDK to query the indexed vector embeddings and retrieve relevant results.

Using a similarity metric to return the nearest neighbors

By calculating the cosine similarity between the query embedding and the content embeddings stored in the index, we can retrieve the top 10 most relevant articles.

👉 See it in action Plagiarism Checker using NLP

About

Benjamin ("Benj") Tabares Jr. is an experienced data practitioner with a strong track record of successfully delivering short- and long-term projects in data engineering, business intelligence, and machine learning. Passionate about solving complex customer challenges, Benj leverages data and technology to create impactful solutions. He collaborates closely with clients and stakeholders to deliver scalable data solutions that unlock business value and drive meaningful insights from data.

Jan 24

2 min read

Comments

Commenting on this post isn't available anymore. Contact the site owner for more info.

Building a Simple Content Retrieval Web App with Pinecone and Python Flask

Send a Message