下届世界杯也许没有C罗梅西,但可以AR直播观看3D立体足球比赛

  • 2020年03月13日 05:48
  • 来源:ARVR科技报道
  • 作者:足球比分_比分直播_足球比分直播_足球比分网

“桌面上的足球”系统输入一个比赛的视频并仔细观看,跟踪每个球员和他们各自的动作。然后将运动员的图像映射到“从足球视频游戏中提取出来的3D模型”,并放置在该场...

Football matches land on your table thanks to augmented reality

It's World Cup season, so that means that even articles about machine learning have to have a football angle. Today's concession to the beautiful game is a system that takes 2D videos of matches and recreates them in 3D so you can watch them on your coffee table (assuming you have some kind of augmented reality setup, which you almost certainly don't). It's not as good as being there, but it might be better than watching it on TV.

这是世界杯的季节,这意味着即使是关于机器学习的文章也必须有足球角度。今天对美丽游戏的让步是一个系统,它需要2D匹配的视频并在3D中重新创建它们,这样你就可以在你的咖啡桌上观看它们(假设你有某种增强现实设置,你几乎肯定不会这样做)。它不如在那里好,但它可能比在电视上看更好。

下届世界杯也许没有C罗梅西,但可以AR直播观看3D立体足球比赛

The "Soccer On Your Tabletop" system takes as its input a video of a match and watches it carefully, tracking each player and their movements individually. The images of the players are then mapped onto 3D models "extracted from soccer video games," and placed on a 3D representation of the field. Basically they cross FIFA 18 with real life and produce a sort of miniature hybrid.

“桌面上的足球”系统输入一个比赛的视频并仔细观看,跟踪每个球员和他们各自的动作。然后将运动员的图像映射到“从足球视频游戏中提取出来的3D模型”,并放置在该场的3D表示上。基本上,他们把FIFA 18与现实生活,并产生一种微型混合动力车。

Considering the source data — two-dimensional, low-resolution, and in motion — it's a pretty serious accomplishment to reliably reconstruct a realistic and reasonably accurate 3D pose for each player.

考虑到源数据-二维,低分辨率,和运动-这是一个相当严重的成就,以可靠地重建现实和合理的三维姿态为每个玩家。

Now, it's far from perfect. One might even say it's a bit useless. The characters' positions are estimated, so they jump around a bit, and the ball doesn't really appear much, so everyone appears to just be dancing around on a field. (That's on the to-do list.)

现在,它还远远不够完善。甚至有人说这有点没用。人物的位置估计,所以他们跳了一点,而且球并没有出现很多,所以每个人似乎只是在一个领域跳舞。(这是待办事项清单。)

But the ideais great, and this is a working if highly limited first shot at it. Assuming the system could ingest a whole game based on multiple angles (it could source the footage directly from the networks), you could have a 3D replay available just minutes after the actual match concluded.

但是这个想法很棒,这是一个非常有限的第一次尝试。假设系统可以基于多个角度摄取整个游戏(它可以直接从网络中传出视频片段),你可以在实际比赛结束后的几分钟内进行3D重播。

Not only that, but wouldn't it be cool to be able to gather round a central location and watch the game from multiple angles on it? I've always thought one of the worst things about watching sports on TVs is everyone is sitting there staring in one direction, seeing the exact same thing. Letting people spread out, pick sides, see things from different angles to analyze strategies — that would be fantastic.

不仅如此,但能够聚集在一个中心位置,并从多个角度观看比赛,这不是很酷吗?我一直认为在电视上观看体育运动最糟糕的事情之一是每个人都坐在那里盯着一个方向,看到完全相同的东西。让人们散开,挑边,从不同角度看事物来分析策略——这将是非常棒的。

All we need is for someone to invent a perfect, affordable holographic display that works from all angles and we're set.

我们所需要的是有人发明一个完美的,负担得起的全息显示器,从各个角度工作,我们的设置。

The research is being presented at the Computer Vision and Pattern Recognition conference in Salt Lake City, and it's a collaboration between Facebook, Google, and the University of Washington.

这项研究正在盐湖城的计算机视觉和模式识别会议上进行,它是Facebook、谷歌和华盛顿大学的合作。

下届世界杯也许没有C罗梅西,但可以AR直播观看3D立体足球比赛图1

Soccer On Your Tabletop

We present a system that transforms a monocular video of a soccer game into a moving 3D reconstruction, in which the players and field can be rendered interactively with a 3D viewer or through an Augmented Reality device. At the heart of our paper is an approach to estimate the depth map of each player, using a CNN that is trained on 3D player data extracted from soccer video games. We compare with state of the art body pose and depth estimation techniques, and show results on both synthetic ground truth benchmarks, and real YouTube soccer footage.

我们提出了一种系统,将足球游戏的单眼视频转换成运动的3D重建,其中玩家和场可以与3D观众交互地或通过增强现实设备来呈现。在我们的论文的核心是一种方法来估计每个球员的深度图,使用一个美国有线电视新闻网,训练从3D视频播放器数据提取的视频游戏。我们比较最先进的身体姿势和深度估计技术,并显示合成地面真理基准,以及真正的YouTube足球镜头的结果。

1Introduction

Imagine watching a 3D hologram of a live soccer gameon your living room table; you can walk around with anAugmented Reality device, watch the players from differentviewpoints, and lean in to see the action up close.

想象一下在你的起居室桌子上观看一个生动的足球游戏的3D全息图,你可以用一个增强现实设备四处走动,观看来自不同视角的玩家,并精力充沛地看近距离的动作。

One way to create such an experience is to equip thesoccer field with many cameras, synchronize the cameras,and then reconstruct the field and players in 3D using multiviewgeometry techniques. Approaches of that spirit werepreviously proposed in the literature [14, 13, 19] and evencommercialized as Replay’s FreeD, and others [1]. The resultsof multi-view methods are impressive, however the requirement of physically instrumenting the field with manysynchronized cameras limits their generality. What if, instead,we could reconstruct any soccer game just from asingle YouTube video? This is the goal of this paper.

创造这样一种体验的一种方法是为足球场装备许多摄像机,使摄像机同步,然后利用多视点几何技术在3D中重建场和玩家。这种精神的方法以前在文献〔14, 13, 19〕中提出,甚至商业化为重放的释放,而其他的〔1〕。多视点方法的结果是令人印象深刻的,然而,在物理上用多个同步摄像机对该场进行测量的要求限制了它们的通用性。如果我们可以从一个YouTube视频中重建任何足球游戏呢?这是本文的目的所在。

There are numerous challenges in monocular reconstructionof a soccer game. We must estimate the camera poserelative to the field, detect and track each of the players, reconstructtheir body shapes and poses, and render the combinedreconstruction.

足球比赛的单眼重建存在很多挑战。我们必须估计相机相对于场地的姿势,检测和跟踪每个球员,重建他们的身体形状和姿势,并呈现组合重建。

下届世界杯也许没有C罗梅西,但可以AR直播观看3D立体足球比赛图2

We present the first end-to-end system (Fig. 2) that accomplishesthis goal (short of reconstructing the ball, whichremains future work). In addition to the system, a keytechnical contribution of our paper is a novel method forplayer body depth map estimation from a single frame. Ourapproach is trained on meshes extracted from FIFA videogames. Based on this data, a neural network estimates perpixel depth values of any new soccer player, comparing favorablyto other state-of-the-art body depth and pose estimationtechniques.

我们提出的第一个端到端系统(图2),完成这一目标(缺少重建球,这仍然是未来的工作)。除了该系统,我们的论文的一个关键技术贡献是一种新的球员身体深度图估计方法。我们的方法是从国际足联视频游戏中提取的网格训练。基于该数据,神经网络估计任何新的足球运动员的每像素深度值,与其他最先进的身体深度和姿势估计技术相比有利。

We present results on 10 YouTube games of differentteams. Our results can be rendered using any 3D viewer, enablingfree-viewpoint navigation from the side of the fieldrecorded by the game camera. We also implemented “holographic”Augmented Reality viewing with HoloLens, projectedonto a tabletop. See the supplementary material forthe AR video results and the 3D model of the game.

我们目前的结果在10 YouTube游戏的不同团队。我们的结果可以使用任何3D观看者渲染,使得从游戏相机记录的场的侧面实现自由视点导航。我们还实施了“全息”增强现实观看与HoloLens,投射到桌面上。请参阅AR视频结果和游戏3D模型的补充资料。

2. Related Work

Sports Analysis Sports game analysis has been extensivelyinvestigated from the perspectives of image processing,computer vision, and computer graphics [32], both foracademic research and for industry applications. Understandinga sports game involves several steps, from fieldlocalization to player detection, tracking, segmentation, etc.Most sports have a predefined area where the action is happening;therefore, it is essential to localize that area w.r.t. thecamera. This can be done with manual correspondences andcalibration based on, e.g., edges [5], or fully automatically[21]. In this work, we follow a field localization approachsimilar to [5].

体育分析运动游戏分析已经从图像处理、计算机视觉和计算机图形学(32)的角度进行了广泛的研究,既用于学术研究,又用于工业应用。了解一个体育游戏涉及到几个步骤,从现场定位到玩家检测、跟踪、分割等。大多数运动都有一个预定义的区域,其中的动作正在发生,因此,有必要对该区域进行定位。这可以通过手动通信和基于例如边缘[5 ]或完全自动[ 21 ]的校准来完成。在这项工作中,我们遵循类似于[5 ]的场定位方法。

Sports reconstruction can be achieved using multiplecameras or specialized equipment, an approach that hasbeen applied to free viewpoint navigation and 3D replaysof games. Products such as Intel FreeD [1] produce newviewing experiences by incorporating data from multiplecameras. Similarly, having a multi-camera setup allowsmultiview stero methods [18, 19] for free viewpoint navigation[17, 47, 16], view interpolation based on player triangulation[14] or view interpolation by representing players asbillboards [13]. In this paper, we show that reliable reconstructionfrom monocular video is now becoming possibledue to recent advances in people detection [38, 7], tracking[31], pose estimation [49, 37], segmentation [20], and deeplearning networks. In our framework, the input is broadcastvideo of a game, readily available on YouTube and otheronline media sites.

运动重建可以使用多个摄像机或专用设备来实现,这种方法已经应用于游戏的自由视点导航和3D重放。英特尔等免费产品(1)通过结合多个相机的数据产生新的观看体验。类似地,具有多相机设置允许多视点Stor方法(18, 19)用于自由视点导航[17, 47, 16 ],基于播放器三角测量(14)的视图插值或通过表示玩家作为广告牌的视图插值[13 ]。在本文中,我们表明,从单目视频的可靠重建现在成为可能,因为人们检测的最新进展〔38, 7〕、跟踪〔31〕、姿势估计〔49, 37〕、分段〔20〕和深度学习网络。在我们的框架中,输入是一个游戏的广播视频,在YouTube和其他在线媒体网站上很容易获得。

Human Analysis Recently, there has been enormous improvementin people analysis using deep learning. Persondetection [38, 7] and pose estimation [49, 37] provide robustbuilding blocks for further analysis of images and video.Similarly, semantic segmentation can provide pixel-levelpredictions for a large number of classes [51, 27]. In ourwork, we use such predictions (bounding boxes from [38],pose keypoints [49], and people segmentation [51]) as inputsteps towards a full system where the input is a single videosequence, and the output is a 3D model of the scene.

人类分析近年来,人们对深度学习的分析有了很大的改进。人物检测(38, 7)和姿态估计(49, 37)为图像和视频的进一步分析提供健壮的构建块。类似地,语义分割可以为大量的类提供像素级预测[51, 27 ]。在我们的工作中,我们使用这样的预测(从[38 ]的边界框,姿势关键点[49 ],和人分割(51)]作为输入步骤朝向一个完整的系统,其中输入是一个单一的视频序列,并且输出是场景的3D模型。

Analysis and reconstruction of people from depth sensorsis an active area of research [44, 3], but the use of depthsensors in outdoor scenarios is limited because of the interferencewith abundant natural light. An alternative wouldbe to use synthetic data [48, 22, 46, 43], but these virtualworlds are far from our soccer scenario. There is extensivework on depth estimation from images/videos of indoor[10] and road [15] scenes, but not explicitly for humans.Recently, the work of [48] proposes a human part anddepth estimation method trained on synthetic data. They fita parametric human model [29] to motion capture data anduse cloth textures to model appearance variability for arbitrarysubjects and poses when constructing their dataset. Incontrast, our approach takes advantage of the restricted soccerscenario for which we construct a dataset of depth map/ image pairs of players in typical soccer clothing and bodyposes extracted from a high quality video game. Anotherapproach that can indirectly infer depth for humans from2D images is [4]. This work estimates the pose and shapeparameters of a 3D parametric shape model in order to fitthe observed 2D pose estimation. However, the method relieson robust 2D poses, and the reconstructed shape doesnot fit to the players’ clothing. We compare to both of thesemethods in the Experiments section.

深度传感器的人的分析和重建是一个活跃的研究领域[44, 3 ],但是由于外界自然光的干扰,深度传感器在室外场景中的使用受到限制。另一种选择是使用合成数据〔48, 22, 46,43〕,但这些虚拟世界离我们的足球场景很远。有广泛的工作深度估计从图像/视频室内[ 10 ]和道路[ 15 ]场景,但不明确为人类。最近,[48]的工作提出了一种在合成数据上训练的人的部分和深度估计方法。他们适合参数化人体模型(29)的运动捕捉数据,并使用布纹理模型的外观变化为任意主题和姿态时,构建他们的数据集。相反,我们的方法利用的限制足球场景,我们构建了一个数据集的深度地图/图像对球员在典型的足球服装和身体姿势从高品质的视频游戏中提取。另一种方法可以间接推断人类从2D图像的深度是[ 4 ]。这项工作估计的三维参数化形状模型的姿态和形状参数,以适应所观察到的2D姿态估计。然而,该方法依赖于鲁棒的2D姿态,并且重建的形状不适合玩家的服装。我们在实验部分比较了这两种方法。

Multi-camera rigs are required for many motion captureand reconstruction methods [8, 45]. [33] uses a CNN personsegmentation per camera and fuses the estimations in3D. Body pose estimation from multiple cameras is usedfor outdoor motion capture in [40, 11]. In the case of a singlecamera, motion capture can be obtained using 3D poseestimators [35, 36, 30]. However, these methods providethe 3D position only for skeleton joints; estimating full humandepth would require additional steps such as parametricshape fitting. We require only a single camera.

多摄像机RIPS对于许多运动捕获和重建方法是必需的〔8, 45〕。〔33〕使用美国有线电视新闻网人分割相机,融合3D估计,在40, 11中使用多摄像机的身体姿态估计进行室外运动捕捉。在单个摄像机的情况下,可以使用3D姿态估计器(35, 36, 30)获得运动捕获。然而,这些方法仅为骨骼关节提供3D位置;估计完整的人体深度需要额外的步骤,例如参数形状拟合。我们只需要一架照相机。

图3

3. Soccer player depth map estimation

A key component of our system is a method for estimatinga depth map for a soccer player given only a single imageof the player. In this section, we describe how we traina deep network to perform this task.

我们的系统的一个关键部件是一种用于估计足球运动员的深度图的方法,该方法仅给出球员的单个图像。在本节中,我们将描述如何训练一个深网络来执行这个任务。

3.1. Training data from FIFA video games

State-of-the-art datasets for human shape modelingmostly focus on general representation of human bodiesand aim at diversity of body shape and clothing [29, 48].Instead, to optimize for accuracy and performance in ourproblem, we want a training dataset that focuses solely onsoccer, where clothing, players’ poses, camera views, andpositions on the field are very constrained. Since our goalis to estimate a depth map given a single photo of a soccerplayer, the ideal training data would be image and depthmap pairs of soccer players in various body poses and clothing,viewed from a typical soccer game camera.

目前最先进的人体模型数据集主要集中在人体的一般表征上,针对人体形状和服装的多样性(29, 48)。取而代之的是,为了优化我们的问题的准确性和性能,我们想要一个只专注于足球的训练数据集,其中服装、球员姿势、相机视图和场地上的位置都非常受限。由于我们的目标是估计一个足球运动员的单个照片的深度图,理想的训练数据将是足球运动员在各种身体姿势和服装的图像和深度图对,从一个典型的足球游戏相机观看。

The question is: how do we acquire such ideal data? Itturns out that while playing Electronic Arts FIFA games andintercepting the calls between the game engine and the GPU[42, 41], it is possible to extract depth maps from videogame frames.

问题是:我们如何获取这样的理想数据?事实证明,在玩电子艺界国际足联游戏和拦截游戏引擎和GPU(42, 41)之间的呼叫时,可以从视频游戏帧中提取深度图。

In particular, we use RenderDoc [2] to intercept the callsbetween the game engine and the GPU. FIFA, similar tomost games, uses deferred shading during game play. Havingaccess to the GPU calls enables capture of the depth andcolor buffers per frame1. Once depth and color is capturedfor a given frame we process it to extract the players.

特别是,我们使用ReNordDOC(2)拦截游戏引擎和GPU之间的调用。国际足联,类似于大多数游戏,在游戏中使用延迟阴影。访问GPU调用可以捕获每帧1的深度和颜色缓冲器。一旦深度和颜色被捕获到一个给定的帧,我们处理它来提取玩家。

The extracted color buffer is an RGB screen shot of thegame, without the score and time counter overlays and thein-game indicators. The extracted depth buffer is in Nor-malized Device Coordinates (NDC), with values between0 and 1. To get the world coordinates of the underlyingscene we require the OpenGL camera matrices that wereused for rendering. In our case, these matrices were not directlyaccessible in RenderDoc, so we estimated them (seeAppendix A in supplementary material).

所提取的颜色缓冲区是游戏的RGB屏幕截图,没有得分和时间计数器叠加和游戏中的指示器。所提取的深度缓冲器在非或非设备化坐标(NDC)中,值介于0和1之间。为了获得底层场景的世界坐标,我们需要用于渲染的OpenGL相机矩阵。在我们的例子中,这些矩阵不是直接在RelordDoc中访问的,所以我们估计了它们(参见补充材料中的附录A)。

Given the game camera parameters, we can convert thez-buffer from the NDC to 3D points in world coordinates.The result is a point cloud that includes the players, theground, and portions of the stadium when it is visible. Thefield lies in the plane y = 0. To keep only the players, weremove everything that is outside of the soccer field boundariesand all points on the field (i.e., points with y = 0).To separate the players from each other we use DBSCANclustering [12] on their 3D locations. Finally, we projecteach player’s 3D cluster to the image and recalculate thedepth buffer with metric depth. Cropping the image andthe depth buffer around the projected points gives us theimage-depth pairs – we extracted 12000 of them – for traininga depth estimation network (Fig. 3). Note that we use aplayer-centric depth estimation because we get more trainingdata by breaking down each frame into 10-20 players,and it is easier for the network to learn individual player’sconfiguration rather than whole-scene arrangements.

给定游戏摄像机参数,我们可以将Z缓冲器从NDC转换成3D坐标在世界坐标系中。结果是一个点云,包括运动员,地面和体育场的部分,当它是可见的。该场位于平面y=0。只保留球员,我们除去足球场界线以外的一切和场地上的所有点(即,Y=0的点)。为了将玩家彼此分离,我们在其3D位置上使用dBSCAN群集[12 ]。最后,我们将每个玩家的3D集群投射到图像中,并用度量深度重新计算深度缓冲器。裁剪图像和围绕投影点的深度缓冲器给我们图像深度对-我们提取了其中的12000个-用于训练深度估计网络(图3)。注意,我们使用一个以玩家为中心的深度估计,因为我们通过分解每个帧到10-20个玩家来获得更多的训练数据,并且网络更容易学习个体玩家的配置而不是整个场景安排。

3.2. Depth Estimation Neural Network

Given the depth-image pairs extracted from the videogame, we train a neural network to estimate depth for anynew image of a soccer player. Our approach follows thehourglass network model [34, 48]: the input is processedby a sequence of hourglass modules – a series of residualblocks that lower the input resolution and then upscale it –and the output is depth estimates.

给定从视频游戏中提取的深度图像对,我们训练神经网络来估计足球运动员的任何新图像的深度。我们的方法遵循沙漏网络模型〔34, 48〕:输入由沙漏模块序列处理-一系列残余块,这些输入块降低输入分辨率,然后对其进行升级,并且输出是深度估计。

Specifically, the input of the network is a 256×256 RGBimage cropped around a player together with a segmentationmask for the player, resulting in a 4-channel input.We experimented with training on no masks, ground truthmasks, and estimated masks. Using masks noticeably improvedresults. In addition, we found that using estimatedmasks yielded better results than ground truth masks. Withestimated masks, the network learns the noise that occurs inplayer segmentation during testing, where no ground truthmasks are available. To calculate the player’s mask, we applythe person segmentation network of [51], refined witha CRF [25]. Note that our network is single-player-centric:if there are overlapping players in the input image, it willtry to estimate the depth of the center one (that originallygenerated the cropped image) and assign the other players’pixels to the background.

具体地说,网络的输入是一个256×256 RGB图像,在一个播放器周围环绕着一个播放器的分割掩码,从而产生一个4通道输入。我们试验了没有面具、地面真相面具和估计面具的训练。使用面具明显改善的结果。此外,我们发现使用估计掩模比地面真实掩模产生更好的结果。通过估计掩码,网络在测试期间学习播放器分割中发生的噪声,其中没有可用的地面实况掩码。为了计算球员的面具,我们应用的人分割网络的[51 ],精制与CRF〔25〕。注意我们的网络是单玩家中心的:如果在输入图像中有重叠的玩家,它将尝试估计中心的深度(最初生成裁剪图像)并分配其他玩家的像素到背景。

The input is processed by a series of 8 hourglass modulesand the output of the network is a 64×64×50 volume,representing 49 quantized depths (as discrete classes) and1 background class. The network was trained with cross entropy loss with batch size of 6 for 300 epochs with learningrate 0.0001 using the Adam [24] solver (see details ofthe architecture in supplementary material).

输入由一系列8个沙漏模块处理,网络输出为64×64×50体积,代表49个量化深度(离散类)和1个背景类。使用亚当(24)求解器(300),使用交叉熵损失训练了具有300个时间段的批次大小为6的交叉熵损失(参见补充材料中的体系结构的细节)。

The depth parameterization is performed as follows:first, we estimate a virtual vertical plane passing through themiddle of the player and calculate its depth w.r.t. the camera.Then, we find the distance in depth values between aplayer’s point and the plane. The distance is quantized into49 bins (1 bin at the plane, 24 bins in front, 24 bins behind)at a spacing of 0.02 meters, roughly covering 0.5 meters infront and in back of the plane (1 meter depth span). In thisway, all of our training images have a common referencepoint. Later, during testing, we can apply these distanceoffsets to a player’s bounding box after lifting it into 3D(see Sec. 4.4).

深度参数化执行如下:首先,我们估计一个虚拟垂直平面通过播放器的中间,并计算其深度W.R.T相机。然后,我们发现玩家点和平面之间的深度值之间的距离。该距离被量化成49个箱子(1个在平面上的箱子,前面的24个箱子,后面的24个箱子),间距为0.02米,大致覆盖正面0.5米和平面背面(1米深跨度)。这样,我们所有的训练图像都有一个共同的参照点。稍后,在测试期间,我们可以将这些距离偏移应用到玩家的包围盒中,然后将其提升到3D(见SEC)。4.4)。

4. Reconstructing the Game

In this section we describe our full pipeline for 3D reconstructionfrom a soccer video clip.

在这一节中,我们描述了我们从足球视频剪辑三维重建的完整管道。

4.1. Camera Pose Estimation

The first step is to estimate the per-frame parameters ofthe real game camera. Because soccer fields have specificdimensions and structure according to the rules of FIFA, wecan estimate the camera parameters by aligning the imagewith a synthetic planar field template.

第一步是估计真实游戏摄像机的每帧参数。由于足球场具有特定的尺寸和结构,根据国际足联的规则,我们可以通过将图像与合成的平面场模板对准来估计摄像机参数。

4.2. Player Detection and Tracking

The first step of the video analysis is to detect the playersin every frame. While detecting soccer players may seemstraightforward due to the relatively uniform background,most state-of-the-art person detectors still have difficultywhen, e.g., players from the same team occlude each otheror the players are too small。

视频分析的第一步是检测每个帧中的玩家。虽然检测足球运动员可能是直截了当的,因为相对统一的背景,最先进的个人检测器仍然有困难时,例如,来自同一队的球员彼此闭塞或球员太小。

We start with a set of bounding boxes obtained with [39].Next, we refine the initial bounding boxes based on pose informationusing the detected keypoints/skeletons from [49].We observed that the estimated poses can better separatethe players than just the bounding boxes, and the pose keypointscan be effectively used for tracking the players acrossframes.

我们从用[39 ]获得的一组包围盒开始。接下来,我们使用从[49 ]中检测到的关键点/骨架来细化基于姿态信息的初始包围盒。我们观察到,估计的姿态可以更好地分离玩家,而不仅仅是包围盒,姿势关键点可以有效地用于跟踪玩家跨帧。

Finally, we generate tracks over the sequence based onthe refined bounding boxes. Every track has a starting andending location in the video sequence. The distance betweentwo tracks A and B is defined as the 2D Euclideandistance between the ending location of track A and startinglocation of track B, assuming track B starts at a laterframe than track A and their frame difference is smallerthan a threshold (detailed parameters are described in supplementarymaterial). We follow a greedy merging strategy.We start by considering all detected neck keypoints(we found this keypoint to be the most reliable to associatewith a particular player) from all frames as separate tracksand we calculate their pairwise distances. Two tracks aremerged if their distance is below a threshold, and we continueuntil there are no tracks to merge. This step associatesevery player with a set of bounding boxes and poses acrossframes. This information is essential for the later processingof the players, namely the temporal segmentation, depth estimationand better placement in 3D. Fig. 2 shows the stepsof detection, pose estimation, and tracking.

最后,我们基于精确边界框生成序列上的轨迹。每个轨道在视频序列中都有一个起始和结束位置。两个轨道A和B之间的距离被定义为轨道A的结束位置和轨道B的起始位置之间的2D欧几里得距离,假设轨道B在比轨道A晚的帧开始,并且它们的帧差小于阈值(详细参数描述在辅助材料)。我们遵循贪婪合并策略。我们开始考虑所有检测到的颈部关键点(我们发现这个关键点是最可靠的与一个特定的球员相关联)从所有帧作为单独的轨道,我们计算他们的成对距离。如果它们的距离低于阈值,则合并两个轨道,并且我们继续,直到没有轨道合并。这一步骤将每个玩家与一组包围框关联起来,并在帧之间摆姿势。这些信息对于后续的处理是必不可少的,即时间分割、深度估计和3D中更好的放置。图2示出了检测、姿态估计和跟踪的步骤。

4.3. Temporal Instance Segmentation

For every tracked player we need to estimate its segmentationmask to be used in the depth estimation network. Astraightforward approach is to apply at each frame a personsegmentation method [51], refined with a dense CRF [25] aswe did for training. This can work well for the unoccludedplayers, but in the case of overlap, the network estimates areconfused. Although there are training samples with occlusion,their number is not sufficient for the network to estimatethe depth of one player (e.g. the one closer to the center)and assign the rest to the background. For this reason,we “help” the depth estimation network by providing a segmentationmask where the tracked player is the foregroundand the field, stadium and other players are background (thisis similar to the instance segmentation problem [20, 50], butin a 1-vs-all scenario).

对于每一个被跟踪的播放器,我们需要估计其深度估计网络中使用的分割掩模。一个简单的方法是应用在每个帧中的人分割方法(51),用密集的CRF(25)精炼,正如我们为训练所做的那样。这对于未被遮挡的玩家来说可以很好地工作,但是在重叠的情况下,网络估计被混淆。虽然存在有遮挡的训练样本,但是它们的数目对于网络估计一个玩家的深度(例如靠近中心的深度)并将其余部分分配给背景是不够的。为此,我们通过提供一个分段掩码来帮助深度估计网络,其中跟踪的播放器是前景,并且场、运动场和其他播放器是背景(这与实例分割问题[20]、50]相似,但在1-VS ALL场景中)。

图4

4.4. Mesh Generation

The foreground mask from the previous step, togetherwith the original cropped image are fed to the network describedin 3.2. The output of the network is per-pixel, quantizedsigned distances between the player’s surface and avirtual plane w.r.t. the camera. To obtain a metric depthmap we first lift the bounding box of the player into 3D,creating a billboard (we assume that the bottom pixel of the player lies on the ground). We then apply the distance offsetsoutput by the network to the 3D billboard to obtain thedesired depth map.

将前一步骤的前景掩模连同原始裁剪图像馈送到3.2中描述的网络。网络的输出是每像素,玩家的表面与虚拟平面W.R.T之间的量化符号距离。为了获得度量深度图,我们首先将玩家的包围盒提升到3D,创建一个广告牌(我们假设玩家的底部像素位于地面上)。然后,我们将由网络输出的距离偏移应用到3D广告牌,以获得期望的深度图。

The depth map is then unprojected to world coordinatesusing the camera parameters, generating the player’s pointcloudin 3D. Each pixel corresponds to a 3D point and weuse pixel connectivity to establish faces. We texture-mapthe mesh with the input image. Depending on the application,the mesh can be further simplified with mesh decimationto reduce the file size for deployment in an AR device.

深度图然后使用摄像机参数投影到世界坐标,在3D中生成玩家的点云。每个像素对应于3D点,并且我们使用像素连通性来建立面部。我们纹理映射与输入图像的网格。根据应用,网格可以进一步简化,以减少在AAR设备中部署的文件大小。

4.5. Trajectories in 3D

Due to imprecise camera calibration and bounding boxlocalization, the 3D placement of players can “jitter” fromframe to frame. To address this problem, we smooth the 3Dtrajectories of the players.The first termof the objective ensures that the estimated trajectory will beclose to the original detections, and the second term encouragessecond order temporal smoothness.

由于不精确的摄像机标定和包围盒定位,玩家的3D布局可以从帧到帧“抖动”。为了解决这个问题,我们平滑了球员的3D轨迹。目标的第一项保证估计轨迹将接近原始检测,第二项鼓励二阶时间平滑。

5. Experiments

All videos were processed in a single desktop with ani7 processor, 32 GB of RAM and a GTX 1080 with 6GBof memory. The full (unoptimized) pipeline takes approximately15 seconds for a typical 4K frame with 15 players.

所有视频都在一个桌面上进行处理,其中有一个i7处理器,32 GB的RAM和一个带有6GB内存的GTX 1080。完整的(未优化的)流水线需要一个典型的4K帧,大约有15秒,有15个玩家。

图5

Synthetic Evaluation We quantitatively evaluate our approachand several others using a held-out dataset fromFIFA video game captures. The dataset was created in thesame way as the training data (Sec. 3) and contains 32 rgbdepthpairs of images, containing 450 players. We use thescale invariant root mean square error (st-RMSE) [48, 10]to measure the deviation of the estimated depth values offoreground pixels from the ground truth. In this way wecompensate for any scale/translation ambiguity along thecamera’s z-axis. We additionally report segmentation accuracyresults using the intersection-over-union (IoU) metric。

综合评价,我们定量评估我们的方法和其他几个使用从国际足联视频游戏捕获的数据集。数据集以与训练数据相同的方式创建(SEC)。3),包含32个RGB-深度图像对,包含450个播放器。我们使用尺度不变的均方根误差(ST RMSE)〔48, 10〕来测量前景像素的估计深度值与地面实况的偏差。以这种方式,我们补偿任何规模/平移歧义沿相机的Z轴。此外,我们报告分割精度结果使用交叉联合(IOU)度量。

We compare with three different approaches: a) nonhuman-specific depth estimation [6], b) human-specificdepth estimation [48], and c) fitting a parametric humanshape model to 2D pose estimations [4]. For all of thesemethods, we use their publicly available code.

我们与三种不同的方法进行比较:A)非人类特定深度估计(6),B)人类特定深度估计(48),和C)将参数人体形状模型拟合到2D姿态估计[4 ]。对于所有这些方法,我们使用它们公开可用的代码。

The input for all methods are cropped images containingsoccer players. We apply the person detection and pose estimationsteps, as described in Sec. 4, to the original videogame images in order to find the same set of players for allmethods (resulting in 432 player-depth pairs). For each detection,we crop the area around the player to use as a testimage, and we get its corresponding ground truth depth forevaluation. In addition, we lift its bounding box in 3D toget the location of the player in the field and to use it for ourdepth estimation method (note that the bounding box is notalways tight around the player, resulting in some displacementacross the camera’s z-axis).

所有方法的输入都是包含足球运动员的裁剪图像。我们应用人员检测和姿态估计步骤,如SEC中所描述的。4,对原始视频游戏图像进行查找,以便找到同一组玩家的所有方法(导致432个玩家深度对)。对于每个检测,我们将玩家周围的区域作为测试图像使用,并且得到相应的地面真实深度来进行评估。此外,我们在3D中提升它的边界框,以获得玩家在该领域中的位置,并将其用于我们的深度估计方法(注意包围盒在玩家周围并不总是很紧,从而导致在摄像机的Z轴上的一些位移)。

The cropped images come from a larger frame withknown camera parameters; therefore, the depth estimatescan be placed back in the original camera’s (initially empty)depth buffer. Since the depth estimates from the differentmethods depend on the camera settings that each methodused during training, it is necessary to use scale/translationinvariance metrics. In addition, we transform the output of[48] into world units by multiplying by their quantizationfactor (0.045m). Note that our estimates are also in worldunits, since we use the exact dimensions of the field forcamera calibration. For [4], we modify their code to usethe same 2D pose estimates used in our pipeline [49] andwe provide the camera parameters and the estimated 3D locationof the player. Table 1 summarizes the quantitativeresults for depth estimation and player segmentation. Ourmethod outperforms the alternatives both in terms of deptherror and player coverage. This result highlights the benefitof having a training set tailored to a specific scenario.

裁剪后的图像来自于具有已知相机参数的较大帧,因此,深度估计可以被放置在原始相机(初始空)深度缓冲器中。由于来自不同方法的深度估计依赖于在训练期间使用的每种方法的相机设置,因此有必要使用尺度/平移不变性度量。此外,通过将它们的量化因子(0.045 m)乘以,将[48 ]的输出转换成世界单元。请注意,我们的估计也在世界单位,因为我们使用的精确尺寸的领域相机校准。对于[4 ],我们修改他们的代码使用相同的2D姿态估计在我们的管道[49 ]中使用,并且我们提供摄像机参数和估计的3D位置的播放器。表1总结了深度估计和播放器分割的定量结果。我们的方法优于在深度误差和播放器覆盖方面的替代方案。这一结果突出了有适合特定场景的训练集的好处。

The method of [48] assigned a large number of foregroundpixels to the background. One reason is that theirtraining data aims to capture general human appearanceagainst cluttered backgrounds, unlike what is found intypical soccer images. Moreover, the parametric shapemodel [29] that is used in [48, 4] is based on scans of humanswith shapes and poses not necessarily observed in soccergames. Trying to fit such a model to soccer data mayresult in shapes/poses that are not representative of soccerplayers. In addition, the parametric shape model is trainedon subjects wearing little clothing, resulting in “naked” reconstructions.

〔48〕方法将大量前景像素分配给背景。其中一个原因是,他们的训练数据旨在捕捉一般人的外表,以应对混乱的背景,不像在典型的足球图像中发现的那样。此外,在[48, 4 ]中使用的参数形状模型[29 ]是基于人的形状和姿势的扫描,在足球比赛中不一定观察到。试图将这样的模型拟合到足球数据可能会导致形状/姿态不代表足球运动员。此外,参数形状模型训练受试者穿着很少的衣服,导致“裸”重建。

YouTube videos We evaluate our approach on a collectionof soccer videos downloaded from YouTube with 4Kresolution. The initial sequences were trimmed to 10 videoclips shot from the main game camera. Each sequence is150-300 frames and contains various game highlights (e.g.,passing, shooting, etc.) for different teams and with varyingnumbers of players per clip. The videos also contain typicalimaging artifacts, such as chromatic aberration and motionblur, and compression artifacts.

YouTube视频我们评估我们的方法从YouTube下载的4K分辨率的足球视频的集合。最初的序列被剪裁到10个视频剪辑从主游戏相机拍摄。每个序列是150到300帧,并包含不同的游戏重点(例如,传球,射击等)为不同的球队和不同数量的球员每剪辑。视频还包含典型的成像伪影,例如色差和运动模糊以及压缩伪影。

图6

(Fig. 6) shows the depth maps of different methods onreal examples. Similar to the synthetic experiment, thenon-human and non-soccer methods perform poorly. Themethod of [4] correctly places the projections of the posekeypoints in 2D, but the estimated 3D pose and shape are oftendifferent from what is seen in the images. Moreover, theprojected parametric shape does not always correctly coverthe player pixels (also due to the lack of clothing), leadingto incorrect texturing (Fig. 7). With our method, while wedo not obtain full 3D models as in [4], the visible surfacesare modeled properly (e.g. the player’s shorts). Also, aftercorrectly texturing our 3D model, the quantization artifactsfrom the depth estimation are no longer evident. In principle,the full 3D models produced by [4] could enable viewinga player from a wide range of viewpoints (unlike ourdepth maps); however, they will lack correct texture for unseenportions in a given frame, a problem that would requiresubstantial additional work to address.

图6示出了实际例子中不同方法的深度图。类似于合成实验,非人类和非足球方法表现不佳。[4]的方法将姿态关键点的投影正确地放置在2D中,但是估计的3D姿态和形状往往不同于图像中所看到的。此外,投影的参数形状并不总是正确地覆盖播放器像素(也由于缺少衣服),导致不正确的纹理(图7)。用我们的方法,虽然我们没有获得完全的3D模型,如在[4 ]中,可见表面被适当地建模(例如玩家的短裤)。此外,在正确纹理化我们的3D模型之后,深度估计的量化伪影不再明显。原则上,由[4 ]产生的全3D模型能够使玩家从广泛的视域(不像我们的深度图)观看,然而,它们将缺乏对给定帧中未被看到的部分的正确纹理,这将需要大量额外的工作来解决。

图7

Depth Estimation Consistency Our network is trainedon players from individual frames without explicitly enforcingany temporal or viewpoint coherence. Ideally, thenetwork should give compatible depthmaps for a specificplayer seen at the same time from different viewpoints. In (Fig. 8,) we illustrate the estimated meshes on the KTH multiviewsoccer dataset [23], with a player captured from threedifferent, synced cameras. Since we do not have the locationof the player on the field, we use a mock-up camera toestimate the 3D bounding box of the player. The mesheswere roughly aligned with manual correspondences.

深度估计一致性我们的网络是从单个帧的球员训练而不显式地执行任何时间或视点一致性。理想情况下,网络应该为从不同视角同时看到的特定玩家提供兼容的深度图。在图8中,我们示出了KTH多视点足球数据集(23)上的估计网格,从三个不同的同步摄像机捕获的玩家。由于我们没有在球场上的球员的位置,我们使用模拟相机来估计玩家的3D包围盒。网格与手动对应大致对齐。

图8

In addition, for slight changes in body configurationfrom frame to frame, we expect the depthmap to changeaccordingly. Fig. 9 shows reconstructed meshes for fourconsecutive frames, illustrating 3D temporal coherence despiteframe-by-frame reconstruction.

此外,对于从帧到帧的体配置的微小变化,我们期望DeXTMAP相应地改变。图9示出了四个连续帧的重建网格,尽管帧逐帧重建说明了3D时间一致性。

图9

Experiencing Soccer in 3D The textured meshes andfield we reconstruct can be used to visualize soccer contentin 3D. Fig. 10 illustrates novel views for three inputYouTube frames, where the reconstructed players are placedin a virtual stadium. The 3D video content can also beviewed in an AR device such as a HoloLens (Fig. 1), enablingthe experience of watching soccer on your tabletop.

在3D中体验足球,我们重构的纹理网格和场可以用来在3D中可视化足球内容。图10示出了三个输入YouTube框架的新视图,其中重建的玩家被放置在虚拟体育场中。3D视频内容也可以在AR设备如全息图(图1)中观看,从而使在桌面上观看足球的体验成为可能。

图10

See supplemental video.Limitations Our pipeline consists of several steps andeach one can introduce errors. Missed detections lead toplayers not appearing in the final reconstruction. Errors inthe pose estimation can result in incorrect trajectories andsegmentation masks (e.g. missing body parts). While ourmethod can handle occlusions to a certain degree, in manycases the players overlap considerably, causing inaccuratedepth estimations. We do not model jumping players sincewe assume that they always step on the ground. Finally,strong motion blur and low image quality can adversely affectthe performance of the depth estimation network.

请参阅补充视频。限制:我们的管道由几个步骤组成,每一个都会引入错误。错过的检测导致球员没有出现在最后的重建中。姿态估计中的误差会导致不正确的轨迹和分割掩模(例如缺失的身体部分)。虽然我们的方法可以处理闭塞在一定程度上,在许多情况下,球员重叠相当大,导致不准确的深度估计。我们不模拟跳跃球员,因为我们假设他们总是踩在地上。最后,强运动模糊和低图像质量会对深度估计网络的性能产生不利影响。

6. Discussion

We have presented a system to reconstruct a soccer gamein 3D from a single YouTube video, and a deployment thatenables viewing the game holographically on your tabletopusing a Hololens or other Augmented Reality device.The key contributions of the paper are the end-to-end systemand a new state-of-the-art framework for player depthestimation from monocular video.

我们已经提出了一个系统,从一个YouTube视频中重建3D足球游戏,以及一个使用全息相机或其他增强现实设备在桌面上全息观看游戏的部署。该论文的主要贡献是端到端系统和一个新的最先进的单眼视频播放器深度估计框架。

Going forward there are a number of important directionsfor future work. First, only a depth map is reconstructedper player currently, which provides a satisfactoryviewing experience from only one side of the field. Further,occluded portions of players are not reconstructed. Hallucinatingthe opposite sides (geometry and texture) and occludedportions of players would enable viewing from anyangle. Second, further improvements in player detection,tracking, and depth estimation will help reduce occasionalartifacts and reconstructing the ball in the field will enable amore satisfactory viewing of an entire game. In addition,video game data could provide additional information tolearn from, e.g., temporal evolution of a player’s mesh (ifreal-time capture is possible using a different capture engine)and jumping poses that could be detected from depthdiscontinuities between the player and the field.

未来的工作还有许多重要的方向。首先,每个玩家当前只重建深度图,它仅从场的一侧提供令人满意的观看体验。此外,球员的闭塞部分不被重建。幻觉相反的侧面(几何和纹理)和被遮挡的部分将使玩家能够从任何角度观看。第二,玩家检测、跟踪和深度估计的进一步改进将有助于减少偶尔的伪影,并且重建场中的球将使整个游戏更令人满意地观看。此外,视频游戏数据可以提供额外的信息来学习,例如,玩家的网格的时间演化(如果使用不同的捕获引擎进行实时捕获是可能的),并且可以从玩家和场之间的深度不连续性检测到跳跃姿态。

Finally, to watch a full, live game in a HoloLens, weneed both a real-time reconstruction method and a methodfor efficient data compression and streaming.Acknowledgements This work is supported by NSF/IntelVisual and Experimental Computing Award #1538618 andthe UW Reality Lab.

最后,为了在HOLLONS中观看完整的实况游戏,我们需要实时重建方法和有效的数据压缩和流媒体的方法。这项工作是由NSF /英特尔视觉和实验计算奖1538618和UW现实实验室支持的。



(注:来源如注明,足球比分_比分直播_足球比分直播_足球比分网,编辑:思嫒)
" 排球比赛 " 的相关文章

热门关注

极力推荐