blob: 4c37c7471d9084115d7be328bf5099e449aaca2c [file] [log] [blame]
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements. See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership. The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied. See the License for the
.. specific language governing permissions and limitations
.. under the License.
HTML Rendering in Jupyter
=========================
When working in Jupyter notebooks or other environments that support rich HTML display,
DataFusion DataFrames automatically render as nicely formatted HTML tables. This functionality
is provided by the ``_repr_html_`` method, which is automatically called by Jupyter to provide
a richer visualization than plain text output.
Basic HTML Rendering
--------------------
In a Jupyter environment, simply displaying a DataFrame object will trigger HTML rendering:
.. code-block:: python
# Will display as HTML table in Jupyter
df
# Explicit display also uses HTML rendering
display(df)
Customizing HTML Rendering
---------------------------
DataFusion provides extensive customization options for HTML table rendering through the
``datafusion.html_formatter`` module.
Configuring the HTML Formatter
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can customize how DataFrames are rendered by configuring the formatter:
.. code-block:: python
from datafusion.html_formatter import configure_formatter
# Change the default styling
configure_formatter(
max_cell_length=25, # Maximum characters in a cell before truncation
max_width=1000, # Maximum width in pixels
max_height=300, # Maximum height in pixels
max_memory_bytes=2097152, # Maximum memory for rendering (2MB)
min_rows_display=20, # Minimum number of rows to display
repr_rows=10, # Number of rows to display in __repr__
enable_cell_expansion=True,# Allow expanding truncated cells
custom_css=None, # Additional custom CSS
show_truncation_message=True, # Show message when data is truncated
style_provider=None, # Custom styling provider
use_shared_styles=True # Share styles across tables
)
The formatter settings affect all DataFrames displayed after configuration.
Custom Style Providers
-----------------------
For advanced styling needs, you can create a custom style provider:
.. code-block:: python
from datafusion.html_formatter import StyleProvider, configure_formatter
class MyStyleProvider(StyleProvider):
def get_table_styles(self):
return {
"table": "border-collapse: collapse; width: 100%;",
"th": "background-color: #007bff; color: white; padding: 8px; text-align: left;",
"td": "border: 1px solid #ddd; padding: 8px;",
"tr:nth-child(even)": "background-color: #f2f2f2;",
}
def get_value_styles(self, dtype, value):
"""Return custom styles for specific values"""
if dtype == "float" and value < 0:
return "color: red;"
return None
# Apply the custom style provider
configure_formatter(style_provider=MyStyleProvider())
Performance Optimization with Shared Styles
--------------------------------------------
The ``use_shared_styles`` parameter (enabled by default) optimizes performance when displaying
multiple DataFrames in notebook environments:
.. code-block:: python
from datafusion.html_formatter import StyleProvider, configure_formatter
# Default: Use shared styles (recommended for notebooks)
configure_formatter(use_shared_styles=True)
# Disable shared styles (each DataFrame includes its own styles)
configure_formatter(use_shared_styles=False)
When ``use_shared_styles=True``:
- CSS styles and JavaScript are included only once per notebook session
- This reduces HTML output size and prevents style duplication
- Improves rendering performance with many DataFrames
- Applies consistent styling across all DataFrames
Creating a Custom Formatter
----------------------------
For complete control over rendering, you can implement a custom formatter:
.. code-block:: python
from datafusion.html_formatter import Formatter, get_formatter
class MyFormatter(Formatter):
def format_html(self, batches, schema, has_more=False, table_uuid=None):
# Create your custom HTML here
html = "<div class='my-custom-table'>"
# ... formatting logic ...
html += "</div>"
return html
# Set as the global formatter
configure_formatter(formatter_class=MyFormatter)
# Or use the formatter just for specific operations
formatter = get_formatter()
custom_html = formatter.format_html(batches, schema)
Managing Formatters
-------------------
Reset to default formatting:
.. code-block:: python
from datafusion.html_formatter import reset_formatter
# Reset to default settings
reset_formatter()
Get the current formatter settings:
.. code-block:: python
from datafusion.html_formatter import get_formatter
formatter = get_formatter()
print(formatter.max_rows)
print(formatter.theme)
Contextual Formatting
----------------------
You can also use a context manager to temporarily change formatting settings:
.. code-block:: python
from datafusion.html_formatter import formatting_context
# Default formatting
df.show()
# Temporarily use different formatting
with formatting_context(max_rows=100, theme="dark"):
df.show() # Will use the temporary settings
# Back to default formatting
df.show()
Memory and Display Controls
---------------------------
You can control how much data is displayed and how much memory is used for rendering:
.. code-block:: python
configure_formatter(
max_memory_bytes=4 * 1024 * 1024, # 4MB maximum memory for display
min_rows_display=50, # Always show at least 50 rows
repr_rows=20 # Show 20 rows in __repr__ output
)
These parameters help balance comprehensive data display against performance considerations.
Best Practices
--------------
1. **Global Configuration**: Use ``configure_formatter()`` at the beginning of your notebook to set up consistent formatting for all DataFrames.
2. **Memory Management**: Set appropriate ``max_memory_bytes`` limits to prevent performance issues with large datasets.
3. **Shared Styles**: Keep ``use_shared_styles=True`` (default) for better performance in notebooks with multiple DataFrames.
4. **Reset When Needed**: Call ``reset_formatter()`` when you want to start fresh with default settings.
5. **Cell Expansion**: Use ``enable_cell_expansion=True`` when cells might contain longer content that users may want to see in full.
Additional Resources
--------------------
* :doc:`../dataframe/index` - Complete guide to using DataFrames
* :doc:`../io/index` - I/O Guide for reading data from various sources
* :doc:`../data-sources` - Comprehensive data sources guide
* :ref:`io_csv` - CSV file reading
* :ref:`io_parquet` - Parquet file reading
* :ref:`io_json` - JSON file reading
* :ref:`io_avro` - Avro file reading
* :ref:`io_custom_table_provider` - Custom table providers
* `API Reference <https://arrow.apache.org/datafusion-python/api/index.html>`_ - Full API reference