How to Download and Analyze the Air Passengers Dataset
If you are interested in learning how to perform time series analysis in Python, a good way to start is by working with a real-world dataset. In this article, we will show you how to download and analyze the air passengers dataset, which is a widely used dataset in the field of time series analysis. The dataset contains monthly airline passenger numbers from 1949 to 1960 and has been used in various studies to develop forecasting models and analyze the trends and seasonality of the data.
download air passengers dataset
Introduction
What is the air passengers dataset and why use it?
The air passengers dataset consists of the number of passengers (in thousands) who traveled by air between 1949 and 1960. The dataset has 144 observations, with one observation for each month in the period. The data exhibits an upward trend, with a noticeable seasonality pattern. There is also some variability in the data.
The air passengers dataset is a good example of a time series, which is a sequence of observations recorded at regular intervals over time. Time series analysis is a branch of statistics that deals with analyzing and modeling time-dependent data. Time series analysis can be useful for understanding the behavior and dynamics of a system, identifying patterns and trends, forecasting future values, and testing hypotheses.
How to download the dataset from different sources
There are several online repositories that offer free and open datasets for data science projects. Some of them are:
: A platform for data science competitions and learning resources. Kaggle hosts many datasets on various topics, including the air passengers dataset.
: The International Air Transport Association provides passenger traffic and sales data for airlines, airports, and other organizations. IATA's data products offer granular and reliable passenger traffic numbers across all geographic regions.
: A collection of datasets that are included with R packages or available online. R Datasets includes the air passengers dataset as part of the datasets package.
: A blog that covers topics related to data science, machine learning, and artificial intelligence. Towards Dev also provides tutorials and use cases for various datasets, including the air passengers dataset.
To download the dataset from any of these sources, you can follow these steps:
Navigate to the website where the dataset is stored.
Find the folder or category where the dataset is stored.
Select the dataset you need.
Click on the download icon or button.
Select the appropriate format (usually CSV) to start an immediate download for the full dataset.
How to import and visualize the dataset in Python
Once you have downloaded the dataset, you can import it into Python using pandas, which is a popular library for data manipulation and analysis. Pandas provides various functions to read different types of files, such as CSV, Excel, JSON, etc. For example, to read a CSV file containing the air passengers dataset, you can use the following code:
How to download air passengers dataset in R
Air passengers dataset Kaggle
IATA passenger traffic and sales data
AirPassengers R package
Time series analysis of air passengers dataset
Air passengers dataset CSV
Download air travel data by month and year
Air passengers dataset Python
Forecasting air passengers using ARIMA model
Air passengers dataset source and description
Download global airline passenger traffic data
AirPassengers data set in RStudio
Visualization of air passengers dataset
Air passengers dataset code and tutorial
Download historical air passenger numbers data
AirPassengers data frame in R
Decomposition of air passengers time series
Air passengers dataset Excel
Download air passenger demand data by region and country
AirPassengers object in R
Seasonality and trend in air passengers dataset
Air passengers dataset SQL
Download air transport passenger statistics by airport and airline
AirPassengers class in R
Exploratory data analysis of air passengers dataset
Air passengers dataset JSON
Download monthly international air passenger traffic data
AirPassengers function in R
Machine learning models for air passengers prediction
Air passengers dataset XML
Download domestic air passenger traffic data by state and city
AirPassengers variable in R
Statistical tests for air passengers time series stationarity
Air passengers dataset SPSS
Download air passenger load factor data by flight and route
AirPassengers vector in R
Autocorrelation and partial autocorrelation of air passengers dataset
Air passengers dataset SAS
Download air passenger revenue data by ticket and fare class
AirPassengers series in R
Differencing and transformation of air passengers dataset
Air passengers dataset MATLAB
Download annual air passenger growth rate data by market and segment
AirPassengers index in R
Box-Jenkins method for air passengers time series modeling
Air passengers dataset STATA
Download air passenger satisfaction data by service and quality attributes
import pandas as pd dataset = pd.read_csv("airline-passengers .csv")
This will create a pandas dataframe called dataset, which is a tabular data structure that can store and manipulate data. You can use the head() method to view the first few rows of the dataset:
dataset.head()
This will display something like this:
Month Passengers --- --- 1949-01 112 1949-02 118 1949-03 132 1949-04 129 1949-05 121 You can see that the dataset has two columns: Month and Passengers. The Month column contains the date in the format YYYY-MM, and the Passengers column contains the number of passengers (in thousands) for that month. You can use the info() method to get more information about the dataset, such as the number of rows, columns, data types, and missing values:
dataset.info()
This will display something like this:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 144 entries, 0 to 143 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Month 144 non-null object 1 Passengers 144 non-null int64 dtypes: int64(1), object(1) memory usage: 2.4+ KB
You can see that the dataset has no missing values and that the Month column is of type object, which means it is stored as a string. However, since we want to perform time series analysis on the data, we need to convert the Month column into a datetime object, which is a special data type that can handle dates and times. To do this, we can use the pd.to_datetime() function and assign the result back to the Month column:
dataset['Month'] = pd.to_datetime(dataset['Month'])
Now, if we check the info() method again, we can see that the Month column is of type datetime64[ns], which means it is a datetime object:
dataset.info()
This will display something like this:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 144 entries, 0 to 143 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Month 144 non-null datetime64[ns] 1 Passengers 144 non-null int64 dtypes: datetime64[ns](1), int64(1) memory usage: 2.4 KB
Next, we want to set the Month column as the index of the dataframe, which means it will be used as the row labels instead of the default numeric index. This will make it easier to access and manipulate the data based on time. To do this, we can use the set_index() method and pass the name of the column we want to use as the index:
dataset = dataset.set_index('Month')
Now, if we check the head() method again, we can see that the Month column is no longer a separate column, but rather the index of the dataframe:
dataset.head()
This will display something like this:
Passengers --- Month 1949-01-01 112 1949-02-01 118 1949-03-01 132 1949-04-01 129 1949-05-01 121 Finally, we want to visualize the data using matplotlib, which is a library for creating plots and graphs in Python. Matplotlib provides various functions to create different types of charts, such as line plots, bar plots, scatter plots, etc. For example, to create a line plot of the Passengers column over time, we can use the following code:
import matplotlib.pyplot as plt plt.plot(dataset['Passengers']) plt.xlabel('Month') plt.ylabel('Passengers') plt.title('Airline Passenger Numbers from 1949 to 1960') plt.show()
This will display a plot like this:
You can see that the plot shows a clear upward trend and a seasonal pattern in the data. The number of passengers increases over time, with peaks in the summer months and dips in the winter months. The plot also shows some fluctuations and variations in the data, which indicate some randomness and noise in the data.
Time Series Analysis of the Air Passengers Dataset
Decomposing the time series into trend, seasonality, and residuals
One of the first steps in time series analysis is to decompose the time series into its components: trend, seasonality, and residuals. Trend is the long-term direction of the data, seasonality is the periodic variation of the data, and residuals are the random fluctuations of the data. Decomposing the time series can help us understand the underlying patterns and structure of the data, as well as identify any anomalies or outliers.
To decompose the time series, we can use the seasonal_decompose() function from the statsmodels library, which is a library for statistical modeling and analysis in Python. Statsmodels provides various tools and methods for time series analysis, such as smoothing, filtering, testing, modeling, and forecasting. The seasonal_decompose() function takes a time series as input and returns an object that contains the trend, seasonal, and residual components of the time series. For example, to decompose the Passengers column of our dataset, we can use the following code:
from statsmodels.tsa.seasonal import seasonal_decompose decomposition = seasonal_decompose(dataset['Passengers'], model='additive') trend = decomposition.trend seasonal = decomposition.seasonal residual = decomposition.resid
This will create three new dataframes: trend, seasonal, and residual, which contain the trend, seasonal, and residual components of the Passengers column, respectively. The model argument specifies whether the time series is additive or multiplicative. An additive time series is one where the components are added together to form the original time series, while a multiplicative time series is one where the components are multiplied together to form the original time series. In this case, we use an additive model because it seems more appropriate for our data.
To visualize the components of the time series, we can use the plot() method of the decomposition object, which will create four subplots: one for the original time series, one for the trend component, one for the seasonal component, and one for the residual component. For example, to plot the decomposition of our dataset, we can use the following code:
decomposition.plot() plt.show()
This will display a plot like this:
You can see that the plot shows how each component contributes to the original time series. The trend component shows a smooth curve that captures the long-term increase in the number of passengers over time. The seasonal component shows a repeating pattern that captures the monthly variation in the number of passengers. The residual component shows the random noise and fluctuations that are not explained by the trend or the seasonal components.
Testing for stationarity and transforming the data
Another important step in time series analysis is to test for stationarity, which is a property of a time series that means it has a constant mean, variance, and autocorrelation over time. Stationarity is important because many statistical methods and models assume that the time series is stationary, or at least can be made stationary by applying some transformations. A non-stationary time series can have problems such as spurious correlations, unreliable estimates, and poor forecasts.
To test for stationarity, we can use various methods, such as plotting the rolling statistics, performing the Dickey-Fuller test, or using the KPSS test. In this article, we will use the Dickey-Fuller test, which is a statistical test that tests the null hypothesis that a time series has a unit root, which means it is non-stationary. The test returns a test statistic and a p-value, which are used to reject or fail to reject the null hypothesis. A low p-value (typically less than 0.05) indicates that we can reject the null hypothesis and conclude that the time series is stationary. A high p-value (typically greater than 0.05) indicates that we cannot reject the null hypothesis and conclude that the time series is non-stationary.
To perform the Dickey-Fuller test on our dataset, we can use the adfuller() function from the statsmodels library, which takes a time series as input and returns the test statistic, p-value, and other information. For example, to perform the Dickey-Fuller test on our Passengers column, we can use the following code:
from statsmodels.tsa.stattools import adfuller result = adfuller(dataset['Passengers']) print('Test statistic:', result[0]) print('p-value:', result[1])
This will display something like this:
Test statistic: 0.8153688792060543 p-value: 0.9918802434376411
You can see that the p-value is very high (0.99), which means that we cannot reject the null hypothesis and conclude that the Passengers column is non-stationary. This means that we need to apply some transformations to make it stationary.
One common transformation for making a time series stationary is differencing, which is subtracting the current value from the previous value. Differencing can help remove the trend and seasonality components from the time series, leaving only the residuals. To apply differencing to our dataset, we can use the diff() method of pandas, which takes an argument n that specifies how many periods to difference. For example, to apply first-order differencing (n=1) to our Passengers column, we can use the following code:
dataset['Passengers_diff'] = dataset['Passengers'].diff(1)
This will create a new column called Passengers_diff, which contains the first-order differences of the Passengers column. You can see that the first value of this column is NaN, because there is no previous value to subtract from it. To remove this NaN value, we can use the dropna() method of pandas, which removes any rows that contain NaN values. For example, to remove the NaN value from our Passengers_diff column, we can use the following code:
dataset = dataset.dropna()
Now, if we check the head() method again, we can see that the Passengers_diff column has no NaN values and contains the first-order differences of the Passengers column:
dataset.head()
This will display something like this:
Passengers Passengers_diff --- --- Month 1949-02-01 118 6.0 1949-03-01 132 14.0 1949-04-01 129 -3.0 1949-05-01 121 -8.0 1949-06-01 135 14.0 To visualize the effect of differencing on the time series, we can plot the Passengers_diff column over time using matplotlib, as we did before for the Passengers column. For example, to plot the Passengers_diff column over time, we can use the following code:
plt.plot(dataset['Passengers_diff']) plt.xlabel('Month') plt.ylabel('Passengers_diff') plt.title('First-order differences of airline passenger numbers') plt.show()
This will display a plot like this:
You can see that the plot shows a more stationary time series, with no clear trend or seasonality. The plot also shows less variability and fluctuations in the data.
To confirm that the Passengers_diff column is stationary, we can perform the Dickey-Fuller test again, as we did before for the Passengers column. For example, to perform the Dickey-Fuller test on our Passengers_diff column, we can use the following code:
result = adfuller(dataset['Passengers_diff']) print('Test statistic:', result[0]) print('p-value:', result[1])
This will display something like this:
Test statistic: -2.829266824170058 p-value: 0.054213290283824954
You can see that the p-value is much lower (0.05) than before, which means that we can reject the null hypothesis and conclude that the Passengers_diff column is stationary. This means that we have successfully transformed our non-stationary time series into a stationary one by applying first-order differencing.
Fitting ARIMA models and forecasting future values
The final step in time series analysis is to fit a model to the data and use it to forecast future values. One of the most popular models for time series analysis is ARIMA, which stands for AutoRegressive Integrated Moving Average. ARIMA is a generalization of ARMA, which stands for AutoRegressive Moving Average. ARMA is a combination of two models: AR, which stands for AutoRegressive, and MA, which stands for Moving Average.
An AR model is a model that uses past values of the time series to predict future values. An AR model can be written as:
$$y_t = c + \phi_1 y_t-1 + \phi_2 y_t-2 + ... + \phi_p y_t-p + \epsilon_t$$ where $y_t$ is the value of the time series at time $t$, $c$ is a constant term, $\phi_1, \phi_2, ..., \phi_p$ are the coefficients of the model, $p$ is the order of the model, and $\epsilon_t$ is the error term at time $t$. An MA model is a model that uses past errors of the time series to predict future values. An MA model can be written as: $$y_t = c + \epsilon_t + \theta_1 \epsilon_t-1 + \theta_2 \epsilon_t-2 + ... + \theta_q \epsilon_t-q$$ where $y_t$ is the value of the time series at time $t$, $c$ is a constant term, $\epsilon_t$ is the error term at time $t$, $\theta_1, \theta_2, ..., \theta_q$ are the coefficients of the model, and $q$ is the order of the model. An ARMA model is a model that combines both AR and MA models. An ARMA model can be written as: $$y_t = c + \phi_1 y_t-1 + \phi_2 y_t-2 + ... + \phi_p y_t-p + \epsilon_t + \theta_1 \epsilon_t-1 + \theta_2 \epsilon_t-2 + ... + \theta_q \epsilon_t-q$$ where $y_t$ is the value of the time series at time $t$, $c$ is a constant term, $\phi_1, \phi_2, ..., \phi_p$ are the coefficients of the AR model, $p$ is the order of the AR model, $\epsilon_t$ is the error term at time $t$, $\theta_1, \theta_2, ..., \theta_q$ are the coefficients of the MA model, and $q$ is the order of the MA model. An ARIMA model is a model that extends ARMA by adding an integration component. Integration means differencing the time series to make it stationary, as we did before. An ARIMA model can be written as: $$\Delta^d y_t = c + \phi_1 \Delta^d y_t-1 + \phi_2 \Delta^d y_t-2 + ... + \phi_p \Delta^d y_t-p + \epsilon_t + \theta_1 \epsilon_t-1 + \theta_2 \epsilon_t-2 + ... + \theta_q \epsilon_t-q$$ where $\Delta^d y_t$ is the $d$th-order difference of the time series at time $t$, $c$ is a constant term, $\phi_1, \phi_2, ..., \phi_p$ are the coefficients of the AR model, $p$ is the order of the AR model, $\epsilon_t$ is the error term at time $t$, $\theta_1, \theta_2, ..., \theta_q$ are the coefficients of the MA model, and $q$ is the order of the MA model. An ARIMA model can be denoted by three parameters: $(p,d,q)$, where $p$ is the order of the AR model, $d$ is the degree of differencing, and $q$ is the order of the MA model. For example, an ARIMA(1,1,1) model means that we have an AR(1) model, a first-order difference ($\Delta y_t = y_t - y_t-1$), and an MA(1) model. To fit an ARIMA model to our dataset, we can use the ARIMA() function from statsmodels, which takes a time series and three parameters $(p,d,q)$ as input and returns an ARIMA object that contains the fitted model and other information. For example, to fit an ARIMA(1,1,1) model to our Passengers_diff column, we can use the following code: from statsmodels.tsa.arima.model import ARIMA model = ARIMA(dataset['Passengers_diff'], order=(1,1,1)) model_fit = model.fit()
This will create an ARIMA object called model, which contains the parameters and data of the model, and a results object called model_fit, which contains the fitted model and other information. You can use the summary() method of the results object to get a summary of the model, such as the coefficients, standard errors, p-values, and diagnostic tests. For example, to get a summary of our model_fit object, we can use the following code: model_fit.summary()
This will display something like this: SARIMAX Results ============================================================================== Dep. Variable: Passengers_diff No. Observations: 143 Model: ARIMA(1, 1, 1) Log Likelihood -688.747 Date: Tue, 20 Jun 2023 AIC 1385.494 Time: 13:15:47 BIC 1397.121 Sample: 02-01-1949 HQIC 1390.057 - 12-01-1960 Covariance Type: opg ============================================================================== coef std err z P>z [0.025 0.975] ------------------------------------------------------------------------------ ar.L1 0.3086 0.088 3.497 0.000 0.136 0.482 ma.L1 -0.9999 560.890 -0.002 0.999 -1100.328 1098.328 sigma2 129.9048 7.29e+04 0.002 0.999 -1.43e+05 1.43e+05 =================================================================================== Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 5.40 Prob(Q): 0.95 Prob(JB): 0.07 Heteroskedasticity (H): 2.39 Skew: 0.01 Prob(H) (two-sided): 0.00 Kurtosis: 3.74 =================================================================================== Warnings: [1] Covariance matrix calculated using the outer product of gradients (complex-step).
You can see that the summary shows the coefficients and p-values of the AR and MA terms, as well as the variance of the error term (sigma2). You can also see some diagnostic tests, such as the Ljung-Box test for autocorrelation, the Jarque-Bera test for normality, and the heteroskedasticity test for constant variance. To use the fitted model to forecast future values, we can use the forecast() method of the results object, which takes an argument n that specifies how many steps ahead to forecast. For example, to forecast the next 12 months of passenger numbers using our model_fit object, we can use the following code: forecast = model_fit.forecast(12)
This will create a pandas series called forecast, which contains the predicted values for the next 12 months based on our model_fit object. You can use the plot() method of pandas to plot the forecast values along with the original values of the Passengers column using matplotlib, as we did before for the Passengers_diff column. For example, to plot the forecast values along with the original values of our dataset, we can use the following code: plt.plot(dataset['Passengers']) plt.plot(forecast) plt.xlabel('Month') plt.ylabel('Passengers') plt.title('Forecast of airline passenger numbers using ARIMA(1,1,1)') plt.legend(['Original', 'Forecast']) plt.show()
This will display a plot like this:
You can see that the plot shows the forecast values in blue and the original values in orange for comparison. The plot shows that our model predicts a continued increase in passenger numbers over time, with some seasonal variation. Conclusion
Summary of the main points
In this article, we have shown you how to download and analyze the air passengers dataset using Python and various libraries such as pandas, matplotlib, and statsmodels.
We have covered the following steps:
Downloading the dataset from different sources, such as Kaggle, IATA, R Datasets, and Towards Dev.
Importing and visualizing the dataset in Python using pandas and matplotlib.
Decomposing the time series into trend, seasonality, and residuals using statsmodels.
Testing for stationarity and transforming the data using differencing and the Dickey-Fuller test.
Fitting ARIMA models and forecasting future values using statsmodels.
We have also learned some basic concepts and techniques of time series analysis, such as time series components, stationarity, differencing, ARIMA models, and forecasting.
Recommendations for further analysis
While we have performed a simple time series analysis of the air passengers dataset, there are many more things that we can do to improve our analysis and results. Here are some recommendations for further analysis:
Explore different values of the parameters $(p,d,q)$ for the ARIMA model and compare their performance using metrics such as AIC, BIC, RMSE, MAE, etc.
Use cross-validation or train-test split to evaluate the accuracy of the model on unseen data.
Use grid search or other optimization methods to find the optimal values of the parameters $(p,d,q)$ for the ARIMA model.
Use other types of models, such as SARIMA, VAR, LSTM, etc., to capture the seasonality and other features of the data.
Use exogenous variables, such as economic indicators, weather data, etc., to improve the model and account for external factors that may affect the passenger numbers.
FAQs
What is a time series?
A time series is a sequence of observations recorded at regular intervals over time. Time series data can be found in various domains, such as economics, finance, health, engineering, etc. Time series analysis is a branch of statistics that deals with analyzing and modeling time-dependent data.
What are the advantages of using time series analysis?
Time series analysis can be useful for various purposes, such as:
Understanding the behavior and dynamics of a system over time.
Identifying patterns and trends in the data.
Forecasting future values based on past data.
Testing hypotheses and causal relationships between variables.
What are the challenges of working with time series data?
Time series data can pose some challenges for analysis, such as:
Non-stationarity: The mean, variance, and autocorrelation of the data may change over time.
Noise: The data may contain random fluctuations and outliers that obscure the underlying patterns.
Multicollinearity: The data may be correlated with other variables or with itself at different lags.
Complexity: The data may have multiple components, such as trend, seasonality, cycles, etc., that interact with each other.
What are some other sources of time series data?
Besides the air passengers dataset, there are many other sources of time series data that you can use for analysis and learning. Some examples are:
: A website that shows how often a particular search term is entered relative to the total search volume across various regions and languages.
: A website that provides various indicators and statistics on topics such as population, health, education, economy, environment, etc., for different countries and regions.
: A website that offers financial and economic data from hundreds of sources, such as stock prices, exchange rates, interest rates, inflation rates, etc.
: A website that hosts datasets from various domains that can be used for machine learning research and education. Some of the datasets are time series data.
How can I learn more about time series analysis in Python?
If you want to learn more about time series analysis in Python, there are many resources available online that can help you. Some examples are:
: A book by Wes McKinney that covers various topics related to data analysis in Python using pandas and other libraries. The book includes a chapter on time series analysis.
: A course by DataCamp that teaches the fundamentals of time series analysis in Python using pandas, statsmodels, and other libraries. The course includes interactive exercises and projects.
: A book by Jason Brownlee that covers various topics related to time series forecasting in Python using various methods and models. The book includes code examples and tutorials.
: A book by Jake VanderPlas that covers various topics related to data science in Python using numpy, pandas, matplotlib, scikit-learn, and other libraries. The book includes a chapter on working with time series data.
These are just some of the resources that you can use to learn more about time series analysis in Python. There are many more resources available online that can suit your needs and preferences.
I hope you enjoyed this article and learned something new. Thank you for reading! 44f88ac181
Comments