Table of links
ABSTRACT
1 INTRODUCTION
2 BACKGROUND: OMNIDIRECTIONAL 3D OBJECT DETECTION
3 PRELIMINARY EXPERIMENT
3.1 Experiment Setup
3.2 Observations
3.3 Summary and Challenges
4 OVERVIEW OF PANOPTICUS
5 MULTI-BRANCH OMNIDIRECTIONAL 3D OBJECT DETECTION
5.1 Model Design
6 SPATIAL-ADAPTIVE EXECUTION
6.1 Performance Prediction
5.2 Model Adaptation
6.2 Execution Scheduling
7 IMPLEMENTATION
8 EVALUATION
8.1 Testbed and Dataset
8.2 Experiment Setup
8.3 Performance
8.4 Robustness
8.5 Component Analysis
8.6 Overhead
9 RELATED WORK
10 DISCUSSION AND FUTURE WORK
11 CONCLUSION AND REFERENCES
8 EVALUATION
We evaluated the system’s accuracy and runtime efficiency through extensive experiments conducted on diverse edge devices and in real-world environments.
8.1 Testbed and Dataset
Omnidirectional 3D detection is crucial for various outdoor scenarios, such as robots navigating urban streets. However, only a public dataset for autonomous driving is available. To collect new datasets across diverse environments, such
as campus and city square, we used our mobile 360° camera testbed. The datasets are summarized in Table 3, which we detail in the following. For the public dataset, we used the nuScenes (described in Section 3.1) with large-scale ground-truth annotations in urban driving scenes. We divided the total 850 scenes into sets of 600 for training, 100 for validation, and 150 for testing. Since the training and validation sets are used in our implementation, we conducted the experiments using a test set, which contains 6,019 labeled frames. We used six object types in the dataset—car, truck, bus, pedestrian, motorcycle, and bicycle—excluding less common types such as construction vehicle.
We built a mobile testbed based on a handheld 360° camera, as shown in Figure 9. We configured hardware comprising a 360° camera (Insta360 X3 [25]), LiDAR (Velodyne VLP-16 [32]), and IMU sensors. We calibrated the sensors’ extrinsics to enable coordinate transformation among them. The 360° camera generates a 5.7K resolution rectangular image from two fisheye lenses. Due to severe distortion and rare object occurrence in the upper and lower areas, we excluded these areas. To mitigate the inherent distortion, we split the 360° image into six regions using perspective projection [13]. We calibrated each region’s intrinsics with the checkerboard method [57], treating each as a separate virtual camera. From various outdoor spaces, such as campuses and squares, we collected sensor data and time-synchronized them. Accurate sensor trajectories were obtained using LiDAR-IMU odometry [20], which are then used to re-train the camera motion network. On each LiDAR point cloud, we use an AI-assisted tool [48] to manually label ground-truth 3D bounding boxes with the selected object types in nuScenes. Objects containing few LiDAR points, typically due to occlusion or distance, were excluded for annotation. Due to the labor-intensive nature of the labeling process, we selectively annotated 3,000
**Figure 11: Performance comparison across scenes**
8.2 Experiment Setup
Devices. We used three Jetson edge devices with varying capabilities as summarized in Table 4. AGX Orin offers superior performance in terms of GPU/CPU compute power and memory capacity. Orin Nano has the least powerful GPU and memory but features a stronger CPU than AGX Xavier [10], thanks to the hardware architecture differences. Baselines. For the comparative analysis, we used various BEVDet variants described in our preliminary experiments as baselines. For a fair comparison, all baseline models were converted into FP16 TensorRT models like Panopticus and used batched inference settings for multi-view images. We also used a modified version of Panopticus as a baseline, named Panopticus-Frame, allowing the inference branch to be switched per frame, rather than per camera view in each frame. Unlike Panopticus, this baseline simply selects a branch of the highest predicted accuracy within the latency target, without an ILP solver. Note that even the concept of Panopticus-Frame has not been proposed in prior works.
Metrics. As detailed in Section 3.1, we used mAP and TP errors—mATE and mAVE—to assess the detection performance of our system and baselines. Prediction errors related to object size and orientation are not considered, as these errors remain consistently low across all evaluation targets. To evaluate our system and baselines using a single metric, we combine mAP and TP errors into a detection score (DS) similar to that in the nuScenes. It is calculated as a weighted sum of mAP, weighted by 6, and TP scores, each weighted by 2. Each TP score is acquired by 𝑚𝑎𝑥 (1 − TP error, 0). The weighting scheme emphasizes the importance of detection accuracy while valuing the ability to predict objects’ location and speed.
8.3 Performance
Overall comparison. We compared the DS and latency of Panopticus with its baselines across various devices and latency targets. For each experiment, we first launched the model adaptation process to accommodate the memory and latency constraints. For example, on Xavier, the R152 backbone and corresponding DepthNets, among other modules loaded on memory, are removed due to memory limits. For the fair comparison with baselines, we selected the inference times of baseline BEVDet models as the target latencies. We used the large-scale test set in the nuScenes for end-to-end performance comparison
Figure 10 shows the overall DS and latency of our system and the baselines across all scenes in the test set. As shown in Figure 10a, on Orin, Panopticus achieved DS improvement of 41% on average over baselines with similar latency. Notably, compared to the baseline model with the low inference time of 35ms, Panopticus greatly improved DS by 79%. Moreover, Panopticus achieved latency reduction by 2.6× on average and up to 5.2×, compared to the baseline models showing similar DS. The results demonstrate that Panopticus efficiently utilizes edge computing resources by processing each camera view with a proper branch. The performance improvements are also observed in devices with much-limited computing power, as shown in Figure 10b and 10c. Compared to the baseline models with similar latency profiles, Panopticus achieved average DS improvements of 51% and 32% on Xavier and Orin Nano, respectively. Also, on the same devices, the processing latencies are decreased by 2× and 1.7×, respectively, on average compared to similar DS levels.
Panopticus-Frame also achieved a performance gain by adaptively selecting a branch for each video frame. However, compared to Panopticus, across all devices, PanopticusFrame has 7% to 28% lower DS, and the inference speed is 1.5× slower on average. The results showed that the coarsegrained level of branch selection leads to suboptimal performance. Processing all camera views using a computationally heavy branch may exceed the latency limits, thereby a lowerend branch is selected frequently.
Comparison across different scenes. We evaluated the performance of each scene category listed in Table 3. We used the AGX Orin for the experiments. We set a tight latency objective of 33ms, which corresponds to 30 frames per second (FPS), reflecting the safety-critical application scenarios such as obstacle avoidance. For the baseline models, we used BEVDet with an R34 backbone, which has a latency profile of 33ms on Orin. Figure 11 shows the comparison of DS and mAP on each scene type. In the case of average DS across all scenes, Panopticus outperformed the baseline model and Panopticus-Frame by 62% and 38%, respectively, under 30 FPS constraints. The average mAP improvement was 36% over the baseline model and 18% over the Panopticus-Frame. Due to the complex and dynamic nature of the road environments, mAP is relatively decreased for the driving scenes.
8.4 Robustness
Panopticus adjusts its operation based on the prediction of spatial distribution, considering the diverse properties of surrounding objects. We analyzed the robustness of Panopticus under various circumstances. The experiments were conducted using Orin under real-time condition of 30 FPS.
As a baseline meeting the condition, we used BEVDet with the R34 backbone.
Impact of object distance and size. We analyzed how the detection capability is influenced by the distance (from the camera) and size of the object. Experiments were conducted on our mobile 360° camera dataset, which has diversity in object distance and size. For each video frame, we first obtained the distance (𝑚) and size (𝑚3 ) of all objects and calculated the mean of distance and size. Then, we defined the scene complexity of the frame as the value of mean distance divided by the mean size, classified into one of three complexity levels. Figure 14a shows the changes in mAP as the complexity level increases. The results showed that mAP decreases as the level increases—the negative impact is significant when a large portion of objects are distant and small, which appear less discernible in images. Compared to the baseline model without considering such spatial complexity, Panopticus successfully enhanced the accuracy at a High level by 56%. This effect is also observed in the prediction error of the object’s 3D location, as shown in Figure 14b. Panopticus reduced mATE on High complexity level of frames by 40% compared to the baseline model. Figure 12 displays an example frame from our mobile camera dataset, showing the branch selection of our scheduler and resulting detection boxes. Panopticus allocated more resources to views with many distant or small objects, processing them using enhanced image feature extraction and depth estimation.
Impact of object distance and velocity. We analyzed the impact of object distance and velocity, particularly in dynamic driving scenes where objects often move quickly. We set the scene complexity level of each frame based on the ratio of the objects’ average velocity (𝑚/𝑠) to the average distance. We found that the object speed barely impacts mAP. However, as shown in Figure 14c, the velocity prediction error, i.e., mAVE, greatly increases followed by an increasing number of objects moving fast at a close distance. This is because fast and proximate objects present large positional shifts between two video frames, as seen in Figure 13, making speed prediction challenging. Panopticus processes camera views containing such objects by fusing two consecutive BEV feature maps. As a result, Panopticus reduced mAVE by 2.2× on average compared to the baseline.
Impact of time of day. To compare the detection performance across different times of the day, we categorized all scenes from the datasets into two groups: daytime and nighttime. As shown in Figure 14d, mAP decreases during nighttime, which is attributed to motion blur in images captured under low lighting conditions at night. Notably, in nighttime scenes, Panopticus achieved a 63% improvement in mAP compared to the baseline, showing its robustness to varying lighting conditions.
8.5 Component Analysis
We evaluated the performance of the system component of Panopticus. Experiments were conducted using the test set in the nuScenes. Accuracy predictor. Panopticus predicts the accuracy (i.e., DS) based on the expected spatial distribution to assign appropriate branches to different camera views. Figure 15a, showing the CDF of DS prediction errors, indicates that the 90th percentile error is less than 0.06. Given that Panopticus selects optimal branches based on the relative difference in accuracy predictions across camera views, the negative impact on performance from these errors is negligible. Latency predictor. Our scheduler selects optimal inference branches based on the predicted latencies of the system’s modules. Accurate latency prediction is crucial to meet the latency constraints. Figure 15b shows how many frames were processed within the tight latency targets of 35ms, 70ms, and 80ms on Orin, Xavier, and Orin Nano, respectively. These targets correspond to the inference times of the lowest-performing baseline models shown in Figure 10. On average, Panopticus achieved 94% latency satisfaction across all devices, showing the robustness of the latency predictor. We observed, as shown in Figure 15c, that the prediction error between the expected latency and the actual latency varied across devices. We identified that the increase in errors on Xavier is due to the relatively large variation in the processing time of the tracker’s state update, which is caused by a low-performance CPU. This characteristic led to an increase in prediction errors of state update latencies. Across all devices, however, we observed that the actual latencies of the system mostly fall below the target latencies. Therefore, the impact of latency prediction errors is minimized to meet the constraints.
Gain with tracker’s branch. Panopticus includes a lightweight branch that outputs the predicted states of tracked objects for the target camera view, thereby skipping the detection. We conducted an ablation study to verify the gain of using the tracker’s branch. As shown in Figure 15d, without (w/o) the tracker’s branch, the DS and latency target satisfaction are reduced by 13% and 18%, respectively, compared to with (w/) the branch. This highlights the importance of using the lightweight branch to effectively balance the detection accuracy and efficiency.
8.6 Overhead
We analyzed the memory and power consumption of Panopticus. For the experiment, we obtained the runtime traces of memory and power consumption using tegrastats [4]. The traces were collected every 100ms while executing the system on Orin. We compared the overheads under different latency constraints, 33ms and 150ms. As shown in Table 5, Panopticus consumed 15.5 to 16.7 GB of memory, which is around half of Orin’s memory limit. A large portion of memory overheads is due to the intermediate buffers used for inference acceleration with TensorRT [7]. Under the latency target of 150ms, average and peak power consumption of GPU is 8.1W and 16.7W, respectively. An increased memory and power consumption under the target of 150ms is due to the frequent utilization of powerful branches with large-scale DNN modules.
This paper is