Develop ML Webapp with Hugging Face APIs

Through this experience, I explored the possibility of hugging Face as a backendAPI for hosting multiple ML engines to create multimodal experience.

posted at: 18/9/2023

1. What is Hugging Face 🤗

Huggingface is undoubtedly one of the most influential AI startup companies you often hear the name of. The backers of Huggingface include Google, Amazon, Nvidia, Salesforce, AMD, Intel, IBM and Qualcomm, and releasing the newest ML model through their platform has become a tradition in the AI industry, from Big Corps to small-sized companies.

More Resources

1. News:
Hugging face, GitHub for AI, raises $235M from Nvidia, Amazon and other tech giants
2. Youtube video:
HuggingFace - An AI community with Machine Learning, Datasets, Models and More

2. Hugging Face Hub Overview

Some describe Huggingspace as a "Github" for the AI community. If that is true, what makes Huggingface shine more than Github? To understand this, you need to look into three concepts - Models, Datasets and Spaces.

Model Hub : The Model Hub is where the members of the Hugging Face community can host all of their models.
Dataset Hub : Hugging Face Hub hosts a large number of community-curated datasets, and each dataset is a Git repository.
Space Hub : This is the core differentiator of Hugging Face from other development hubs. Hugging Face provides intuitive libraries (ML and GUI) and infrastructure to showcase the use case of the engine. It uses Git under the hood.

Resource

1. Hugging Face Hub documentation

3. What do I want to Build

As I mentioned above, the beauty of Hugging Face is in the Space Hub, where they put much effort into abstracting the complex technical layers in ML with their Transformer library and provide pre-configured infrastructure to deploy experiences and share with the broader community quickly.

Let's explore Hugging Face Hub by creating an example experience. As I was on my tech journey to revive my old project(Brain Piano), I wanted to create something that had an audio output in the end. So, I jotted down my fast-prototyping strategy as below.

The output of the experience will be the music audio file.
I want a multimodal experience - Having other multimedia as an input.

After much research, I decided to use two ML models in the Data Hub with ChatGPT and weave them together to create a piece of music as described below.

Fig1. ML Model flow

Resources

1. Image to Text Model - Salesforce/blip-image-captioning-large

2. ChatGPI Plugins - Langchain

3. Music Generator Model - Facebook

4. How to Hugging Face

After creating the account, click on the "+ New Space" button to create a space.

Name the project, and select "Gradio" development environment. Gradio is an open-source Python package that allows developers to create customized UI for ML models effortlessly and quickly. Moreover, I was fascinated by the feature that automatically generates API endpoints and the web interface.

After creating the Space, you will be instructed to clone the repository to your local environment with familiar git commands. You can later push to the main to deploy to the Space by setting SSH credentials.

5. Develop App

The image below shows the minimal files that need to be included in the project folder and their functions. requirements.txt is where you need to include Python dependencies and packages.txt to list your Debian dependencies. app.py will be the entry point to your program.

1
2# File Structure
3.
4├── requirements.txt # Python dependencies
5├── packages.txt # Debian dependencies
6└── app.py # Our main code
7

Now, let's write the app code. We import the Transformer library to download the pre-trained ML engine hosted in Hugging Face Hub and predict the result locally. You can easily find the instructions to import ML models using the Transformer from this task page.

1
2import scipy
3from datasets import load_dataset
4from langchain.chains import LLMChain
5from langchain.llms import OpenAI
6from langchain.prompts import PromptTemplate
7from transformers import AutoProcessor, MusicgenForConditionalGeneration, pipeline
8
9
10# Step 1: Image to Text 
11def imageToText(url):
12    image_to_text = pipeline(
13        "image-to-text", model="Salesforce/blip-image-captioning-large"
14    )
15    text = image_to_text(url)
16    return text[0]["generated_text"]
17
18
19# Step 2: Generate Story from ChatGPT 
20def storyGeneratorGPT(user_input):
21    template = """
22    You are a music story teller;
23    You can suggest music that suits the scenario;
24    The suggested music should include the genre of the music as well as the style where it is inpired from;
25    The suggestion should be no more than 20 words.
26    
27    CONTEXT: {scenario}
28    STORY:
29    """
30
31    prompt = PromptTemplate(template=template, input_variables=["scenario"])
32    prompt.format(scenario=user_input)
33    story_chain = LLMChain(
34        llm=OpenAI(model_name="gpt-3.5-turbo", temperature=1),
35        prompt=prompt,
36        verbose=True,
37    )
38    story = story_chain.run(user_input)
39    # print(story)
40    return story
41
42# Step 3: Generate Music based on the description
43def generate(text):
44    print("generate..")
45    print(text)
46    processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
47    model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")
48    inputs = processor(
49        # audio=load_input(),
50        text=[text],
51        padding=True,
52        return_tensors="pt",
53    )
54
55    audio_values = model.generate(**inputs, max_new_tokens=256)  # 256
56    sampling_rate = model.config.audio_encoder.sampling_rate
57    resultFile = "musicgen_out.wav"
58    scipy.io.wavfile.write(
59        resultFile,
60        rate=sampling_rate,
61        data=audio_values[0, 0].numpy(),
62    )
63    return resultFile
64

Next step is to cascade three ML functions with Gradio UIs. You will see how simple it is to connect the codes and generate a decent interface here by calling gr.Series function, we can connect the inputs and outputs of the ML functions like a LEGO block. The task is as simple as returning the values in the correct formats.

1
2import gradio as gr
3
4# ML functions above are defined here ...
5
6series_1 = gr.Interface(
7    fn=imageToText,
8    inputs="pil",
9    outputs="text",
10    examples=["beatles.png"],
11)
12series_2 = gr.Interface(fn=storyGeneratorGPT, inputs="text", outputs="text")
13series_3 = gr.Interface(fn=generate, inputs="text", outputs="video")
14demo = gr.Series(series_1, series_2, series_3)
15

After writing the code, I uploaded the code to the Space. Then, after a while, you will find this fantastic interface. Please find the complete code from here.

Resource

1. Hugging Face Transformer documentation

2. Gradio documentation

3. Hugging Face + Langchain in 5 mins by AI Jason

6. Final Result

I was not able to make the Space public due to the OpenAPI request limits. Nonetheless, I left the working screen records and the output music below to get the flavour of the interface that has been created.

Generated Music

I also included the link to the public Space where I put example prompts to Facebook's MusicGen (with optional melody input). Please note that it will take around 2-3 minutes to generate 5 seconds of music with the free-tier plan.

On the very bottom of the embedded interface, there is Use via API button that you can click to see the exposed API endpoints.