Alibaba Group Holding Ltd ha recentemente presentato R1-Omni, un avanzato modello di intelligenza artificiale sviluppato dal suo laboratorio Tongyi, capace di interpretare le emozioni umane, segnando un significativo progresso nel campo della visione artificiale. Durante le dimostrazioni, R1-Omni ha mostrato una notevole precisione nel dedurre lo stato emotivo di individui presenti in video, fornendo descrizioni dettagliate dell’abbigliamento e dell’ambiente circostante.
Questo modello rappresenta un’evoluzione di HumanOmni, un modello open source, migliorato con la capacità di leggere le emozioni, posizionando Alibaba un passo avanti rispetto ad altri leader nel settore dell’IA.
Il riconoscimento delle emozioni tramite IA implica l’identificazione e l’interpretazione degli stati emotivi umani basandosi su indicatori osservabili come espressioni facciali, linguaggio del corpo, tono della voce e segnali fisiologici.

Di seguito sono riportate le prestazioni sui dataset di riconoscimento delle emozioni. Usiamo simboli per indicare se i dati sono in-distribuzione (⬤) o fuori-distribuzione (△).
Method | DFEW (WAR) ⬤ | DFEW (UAR) ⬤ | MAFW (WAR) ⬤ | MAFW (UAR) ⬤ | RAVDESS (WAR) △ | RAVDESS (UAR) △ |
HumanOmni-0.5B | 22.64 | 19.44 | 20.18 | 13.52 | 7.33 | 9.38 |
EMER-SFT | 38.66 | 35.31 | 38.39 | 28.02 | 29.00 | 27.19 |
MAFW-DFEW-SFT | 60.23 | 44.39 | 50.44 | 30.39 | 29.33 | 30.75 |
R1-Omni | 65.83 | 56.27 | 57.68 | 40.04 | 43.00 | 44.69 |
Questa tecnologia, nota anche come “emotion AI” o “computazione affettiva”, combina informatica e psicologia per facilitare interazioni empatiche tra umani e computer. L’analisi delle emozioni si basa su tre principali approcci:
Analisi testuale: elaborazione di testi scritti o parlati per comprendere i sentimenti espressi, utilizzando tecnologie di elaborazione del linguaggio naturale (NLP) e algoritmi di analisi del sentiment.
Analisi visiva: interpretazione delle espressioni facciali e del linguaggio del corpo attraverso l’uso di algoritmi di visione artificiale.
Analisi audio: valutazione delle caratteristiche vocali, come tono, timbro e ritmo, per rilevare emozioni.
L’integrazione di queste tecniche consente una comprensione più completa delle emozioni umane, migliorando l’interazione uomo-macchina. L’uso dell’IA per il riconoscimento delle emozioni solleva importanti questioni etiche e di privacy.
Ad esempio, l’AI Act classifica come “pratiche ad alto rischio” l’utilizzo di sistemi di IA per il riconoscimento delle emozioni in ambienti lavorativi ed educativi, vietandone l’uso per scopi non medici o non legati alla sicurezza. In conclusione, lo sviluppo di R1-Omni da parte di Alibaba rappresenta un passo significativo verso l’intelligenza artificiale generale (AGI), avvicinando le macchine a una comprensione più profonda delle emozioni umane.
User: As an emotional recognition expert; throughout the video, which emotion conveyed by the characters is the most obvious to you? Output the thinking process in <think> </think> and final emotion in <answer> </answer> tags.
HumanOmni-0.5B: fear
EMER-SFT: <think>In the video, a man in a brown jacket stands in front of a vibrant mural, his face showing clear signs of anger. His furrowed brows and open mouth express his dissatisfaction. From his expressions and vocal traits, it can be inferred that he is experiencing intense emotional turmoil. The audio content mentions "fear" and "lowering one's voice," suggesting that he is facing an urgent situation or expressing strong emotions during an emotionally tense moment. Overall, the analysis indicates that the man is experiencing a deep sense of fear and anxiety.</think>
R1-Omni: <think>In the video, a man in a brown jacket stands in front of a vibrant mural. He is wearing a pink shirt underneath his brown jacket, and his hair is dark and curly. His facial expression is complex, with wide eyes, slightly open mouth, raised eyebrows, and furrowed brows, revealing surprise and anger. Speech recognition technology suggests that his voice contains words like "you", "lower your voice", "freaking out", indicating strong emotions and agitation. Overall, he displays an emotional state of confusion, anger, and excitement.</think>
HumanOmni-0.5B: happy
EMER-SFT: <think>In the video, we see a woman in an indoor setting that appears to be a personal space or a study room. She is wearing a red headband and a beige blazer over a white top, with her hair neatly tied back. Her facial expression shows slight confusion and surprise, as if she has encountered something unexpected. As time goes on, her emotions seem to calm down, but the specific reasons for this are not explained. In the audio, there is a long silence between "but" and "it was," indicating hesitation and uncertainty. In the text, the subtitle reads: "But it was interesting." This sentence may express the confusion and surprise of the person. Based on the video clues of the woman's slightly confused and surprised expression, as well as her overall appearance and facial expression showing minor confusion and surprise, we can infer that this sentence may have a hesitant or uncertain tone. Additionally, the audio clue describing a long silence between "but" and "it was" also indicates some hesitation and uncertainty. Therefore, this sentence expresses the woman's interest and curiosity in something, while also showing a certain level of hesitation and uncertainty.</think>
R1-Omni: <think>In the video, in the opening scene, we see a woman with her eyes slightly closed and mouth slowly opening as if she is laughing. Her facial expression appears somewhat joyful, which may indicate that she is experiencing some pleasant or amusing situation at that moment. In the audio, there are no pauses between sentences, they flow continuously, and the tone is light and cheerful. Combined with the text content, it can be felt that the character is in a very happy and positive emotional state. In the text, the subtitle reads: "It was interesting." This sentence may express the woman's satisfaction and curiosity towards something or someone. Based on the laughter and joyful emotion displayed by the woman in the video clues, as well as the description of continuous flowing, light and cheerful tone in the audio clues, we can infer that this sentence expresses the woman's interest and curiosity towards something. The joy in this sentence aligns with the overall display of happiness and positivity.</think>