What you could build on top of our data at a Hackthon - Veridion
Blog

What you could build on top of our data at a Hackthon

By: Claudiu Dima - 28 June 2023

Hackathons are the perfect platform for innovation, creativity, and problem-solving. In this article, we present unique problem ideas that blend artificial intelligence (AI) and business, offering a fresh perspective for your next hackathon. These challenges range from crafting company descriptions from images, summarising business operations in five words, to detecting website clones from screenshots. Let’s dive into these problem ideas.

 

Hackathon Problem Idea #1: Crafting Company Descriptions from Site Images

In the realm of hackathons, where innovation and creativity are the keys to success, the idea of generating company descriptions from images stands out as a unique and intriguing challenge. This concept not only combines multiple modalities but also offers a practical solution with wide-ranging applications.

 

The Concept

Imagine taking one or more images related to a company – sourced from their website, social media platforms, or online articles – and generating a comprehensive description of the company based on these visuals. This idea, while seemingly complex, can be broken down into manageable components with the right approach.

 

The Approach

The proposed solution for this problem would involve the use of an image-to-text transformer-based model. This model would combine the encoder of a Vision Transformer (ViT) with the decoder of a text-based Transformer, creating a powerful image-to-text encoder-decoder model.

Here’s a step-by-step breakdown of the process:

  1. Image Processing: The ViT would generate a sequence of feature vectors that represent the image. These feature vectors would then serve as the input for the text-based Transformer decoder.
  2. Text Generation: The decoder, which could be a Transformer model like GPT or T5, would take the feature vectors from the ViT encoder and generate a sequence of tokens. These tokens would then form the output text, i.e., the company description.

 

The Data

The success of this approach hinges on the quality and relevance of the data used. Access to company data, including the domain and a description of the company, would be essential.

Here’s how the data collection and processing would work:

  1. Company Selection: Select companies that have a rich collection of images and comprehensive descriptions.
  2. Image Scraping: Scrape the images from the selected company websites and use an heuristic method to select the most representative ones.
  3. Training: Use the scraped image and the company’s name as input for the model, and train it to generate the existing company description. This process would create a robust set of training examples.

By leveraging the power of the available data and the transformer architecture, this approach can effectively incorporate multiple modalities. The result? The ability to generate detailed descriptions for companies based solely on their name and an image. This innovative solution not only presents a fascinating problem for a hackathon but also opens up new possibilities in the field of image-to-text transformation.


 

For More Info – Join the online Q/A Session 

Q&A Session Hacking Big Numbers - Friday 30 June, 17:30

 


 

Hackathon Problem Idea #2: Distilling Company Descriptions into Five Words

 

Brevity and precision are often as valued as complexity and depth. This problem idea presents a unique challenge: summarising what a company does using only five words. It’s a text-based problem that also involves optimisation, requiring participants to choose their words wisely.

 

The Concept

The task is straightforward yet challenging: distil the essence of a company’s operations into a five-word description. This problem not only tests the participants’ understanding of the company but also their ability to convey complex ideas succinctly.

 

The Approach

The proposed solution for this problem would involve the use of a text-to-text encoder-decoder model, such as BART or T5. However, the key to success lies in the data preparation and processing.

 

Here’s a step-by-step breakdown of the process:

  1. Data Preparation: You have access to the Veridion database, which contains great descriptions of many companies. The first step would be to summarise these descriptions. This could be done in multiple passes, and depending on the final architecture chosen, you might not need many training examples. It’s even viable to use advanced models like GPT-3.5 or GPT-4 for this summarisation task.
  2. Data Validation: Once the summaries are ready, compare them with the original descriptions using powerful embeddings. This will help you identify and discard summaries that have lost vital information during the summarisation process.
  3. Model Training: With the validated training pairs, you can train your encoder-decoder model. Customise the generation parameters to discourage repetition and impose a penalty on length. You might also consider a custom loss function to further optimise the model.
  4. Prediction: During prediction, you can control the generation to output exactly five words. Utilise beam search to help you choose the best possible five words that describe the company.

 

The Outcome

This approach presents an intriguing blend of text summarisation, machine learning, and optimisation. The resulting five-word descriptions not only encapsulate the essence of a company but also demonstrate the power of concise communication. This problem idea offers a unique challenge for hackathon participants, pushing them to innovate within constraints and deliver precise, meaningful results.

 


 

Hackathon Problem Idea #3: Identifying Cloned Websites from a Screenshot

Website cloning is a prevalent issue on. This hackathon problem idea presents a unique challenge: given a screenshot of a company’s website, can you identify any clones of it and provide their domains? This vision-based problem comes with its own set of challenges, primarily due to the massive scale of the task.

 

The Concept

With approximately 80 million companies in our database alone, not to mention countless blogs, publications, and parked sites, the task of finding duplicates efficiently can be daunting. However, with a well-structured approach, it becomes a fascinating problem to solve.

 

The Approach

The proposed solution would involve the use of a Vision Transformer (ViT) to generate embeddings for all the screenshots in our database and index them in a vector database. This process might be time-consuming, but it only needs to be done once.

 

Here’s a step-by-step breakdown of the process:

  1. Screenshot Embedding: Use a ViT to generate embeddings for all the screenshots in the database.
  2. Vector Database Creation: Instantiate a vector database, a specialized database system that stores and manages high-dimensional vectors. This enables efficient similarity search operations, which are crucial in machine learning and AI applications where data is often represented as vectors in high-dimensional space.
  3. Index Creation: Insert all the screenshot embeddings into the vector database and create an index.

When a new screenshot is provided, the following steps would be taken:

  1. Image Encoding: Encode the image using the same transformer, converting it into a high-dimensional vector.
  2. Similarity Search: Perform a similarity search using the vector database. This search will be highly efficient, as all the screenshots are already indexed, making the process much faster than a linear search.

The vector database will return the best candidates, i.e., the screenshots most similar to the provided one. A well-implemented database should be able to perform this search in less than a second.

Based on the similarity score and other heuristics, you can then decide if there are any clones of the original website.

 

The Outcome

This approach presents an intriguing blend of image processing, machine learning, and database management. The resulting system not only identifies potential website clones but also demonstrates the power of efficient similarity search operations. This problem idea offers a unique challenge for hackathon participants, pushing them to innovate within constraints and deliver precise, meaningful results.

 


 

Wrapping Up

 

These problem ideas offer a glimpse into the vast potential of AI in solving real-world business challenges. They not only test your technical skills but also your creativity and problem-solving abilities. Whether you’re generating company descriptions from images, distilling business operations into five words, or identifying cloned websites from screenshots, each challenge presents an opportunity to push the boundaries of what’s possible. As you gear up for the hackathon, we hope these ideas inspire you to create innovative solutions that make a real impact. Happy hacking!