Amazon Sales Analysis¶

This is a report directly from an Amazon Seller who primarily sells used books. The goal for this analysis will be to gain insights on the behavior of the business and then create a web-based dashboard to display important KPI's.

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import statsmodels as sm
import matplotlib.pyplot as plt
import re

We'll start with the listing data. First we want to get familiar with the data set and begin cleaning the set if needed.

Cleaning Listing Data¶

In [2]:
lists = pd.read_excel("listingdata.xlsx")
pd.set_option('display.max_columns',None)
lists.head()
Out[2]:
SKU Source Date Code ASIN Cost Per Item List Price Per Item Sales Rank Quantity Condition Notes Date Listed Total Ship Cost Ship Cost Per Item Total Cost Total List Price
0 440-29-10-2021-722K-0.06 Bulk NaN 877842892 0.06 20.00 722010.0 25 UsedGood Book has signs of use and shelf wear. Front an... 2021-10-18 9.00 0.36 1.50 10200.00
1 6/11/19-233-21.88 Online NaN 393614891 21.88 248.00 126131.0 4 UsedLikeNew Little to no signs of use. Front and back cove... 2019-06-03 3.04 0.76 87.52 992.00
2 197-21-07-2021-1.5M-0.00 Bulk NaN 875870708 0.00 506.38 1468886.0 1 UsedAcceptable Book may show heavy signs of wear along the fr... 2021-07-21 0.33 0.33 0.00 506.38
3 65P7-JT1Z Thrift NaN 133918920 0.00 400.00 NaN 1 Used Very Good Small amounts of shelf wear on the bottom of t... 2018-07-09 NaN NaN 0.00 400.00
4 399-17-02-2022-1.6M-0.06 Bulk NaN 1319217788 0.06 363.56 1562458.0 1 UsedGood Book has signs of use and shelf wear. Front an... 2021-12-16 0.33 0.33 0.06 363.56
In [3]:
print(lists.shape)
lists.isna().sum()
(8646, 15)
Out[3]:
SKU                       0
Source                    0
Date Code              8646
ASIN                      0
Cost Per Item           183
List Price Per Item       0
Sales Rank              418
Quantity                  0
Condition                 0
Notes                     0
Date Listed               0
Total Ship Cost         495
Ship Cost Per Item      495
Total Cost                0
Total List Price          0
dtype: int64

So we clearly have some NA values. Here's the approach we'll use for cleaning this data:

  • We'll delete the 'Date Code' column as it seems baiscally pointless since every row is blank
  • We'll fill the 'Cost Per Item' column with the average cost per item
  • For both ship costs, we'll fill with the average

That leaves the Sales Rank. Amazon sales rank is a metric that describes how well that product is selling in a snapshot of time. For example, the #1 best selling book has a sales rank of 1. The 100th best selling book has a sales rank of 100 at that snapshot of time when the data is pulled. Let's look into the sales rank a bit more before we choose how to handle it.

In [4]:
lists['Sales Rank'].describe()
salesbox = px.box(lists,x='Sales Rank')
salesbox.show()

Considering the max rank is 22,000,000 but the top 75% is ~940,000, I think the median is a more accurate measurement to fill the blanks with. That 22 million heavily skews the mean and it can be seen in the box plot. The median will give a more symmetric distribution.

In [5]:
lists['Sales Rank'] = lists['Sales Rank'].fillna(lists['Sales Rank'].median())
lists.isna().sum()
Out[5]:
SKU                       0
Source                    0
Date Code              8646
ASIN                      0
Cost Per Item           183
List Price Per Item       0
Sales Rank                0
Quantity                  0
Condition                 0
Notes                     0
Date Listed               0
Total Ship Cost         495
Ship Cost Per Item      495
Total Cost                0
Total List Price          0
dtype: int64
In [6]:
lists = lists.drop(columns='Date Code', axis=1)
print(lists.columns.values)
['SKU' 'Source' 'ASIN' 'Cost Per Item' 'List Price Per Item' 'Sales Rank'
 'Quantity' 'Condition' 'Notes' 'Date Listed' 'Total Ship Cost'
 'Ship Cost Per Item' 'Total Cost' 'Total List Price']
In [7]:
lists['Cost Per Item'] = lists['Cost Per Item'].fillna(lists['Cost Per Item'].mean())
lists.isna().sum()
Out[7]:
SKU                      0
Source                   0
ASIN                     0
Cost Per Item            0
List Price Per Item      0
Sales Rank               0
Quantity                 0
Condition                0
Notes                    0
Date Listed              0
Total Ship Cost        495
Ship Cost Per Item     495
Total Cost               0
Total List Price         0
dtype: int64
In [8]:
lists['Total Ship Cost'] = lists['Total Ship Cost'].fillna(lists['Total Ship Cost'].mean())
lists['Ship Cost Per Item'] = lists['Ship Cost Per Item'].fillna(lists['Ship Cost Per Item'].mean())
lists.isna().sum()
Out[8]:
SKU                    0
Source                 0
ASIN                   0
Cost Per Item          0
List Price Per Item    0
Sales Rank             0
Quantity               0
Condition              0
Notes                  0
Date Listed            0
Total Ship Cost        0
Ship Cost Per Item     0
Total Cost             0
Total List Price       0
dtype: int64

Exploring the Listing Data¶

We'll start with some basic analysis and exploration of the listing data.

In [9]:
lists.describe()
Out[9]:
Cost Per Item List Price Per Item Sales Rank Quantity Total Ship Cost Ship Cost Per Item Total Cost Total List Price
count 8646.000000 8646.000000 8.646000e+03 8646.000000 8646.000000 8646.000000 8646.000000 8646.000000
mean 2.354808 30.506833 6.676401e+05 1.015036 0.435269 0.427242 2.251007 32.325003
std 4.922704 24.575828 9.735394e+05 0.309862 0.267764 0.224111 5.520884 112.943751
min 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 16.550000 9.530275e+04 1.000000 0.300000 0.300000 0.000000 16.552500
50% 1.900000 28.000000 3.828970e+05 1.000000 0.360000 0.360000 1.400000 28.500000
75% 2.990000 38.000000 9.055845e+05 1.000000 0.540000 0.520000 2.930000 38.000000
max 79.360000 506.380000 2.280078e+07 25.000000 9.000000 1.150000 149.280000 10200.000000
In [10]:
# Total Listings by Condition
salescondition = lists.groupby('Condition').count().reset_index()
salesconditionfig = px.bar(salescondition,x='Condition',y='SKU')
salesconditionfig.show()

So there's some issues with the conditions having repeated values. To solve this, I'm going to remove the 'Used' and provide proper spacing for 'Like New' and 'Very Good'.

In [11]:
lists['Condition'] = lists['Condition'].str.replace('Used ', '')
lists['Condition'] = lists['Condition'].str.replace('Used', '')
lists['Condition'] = lists['Condition'].str.replace('LikeNew', 'Like New')
lists['Condition'] = lists['Condition'].str.replace('VeryGood', 'Very Good')
print(lists['Condition'].unique())

salescondition = lists.groupby('Condition').count().reset_index()
salesconditionfig = px.bar(salescondition,x='Condition',y='SKU',title='Total Listings by Condition',template='seaborn',labels={"SKU":"Listings"})
salesconditionfig.update_xaxes(categoryorder = 'array',categoryarray=['Acceptable','Good','Very Good','Like New','New'])
salesconditionfig.show()
print('Total Listings in "Good" Condition:')
print(lists['Condition'][lists['Condition'] == 'Good'].count())
print('or')
print(((lists['Condition'][lists['Condition'] == 'Good'].count())/lists['SKU'].count()*100),'%')
['Good' 'Like New' 'Acceptable' 'Very Good' 'New']
Total Listings in "Good" Condition:
5185
or
59.969928290538974 %
In [12]:
#Listings per week
listings=lists
listings['Date Listed'] = pd.to_datetime(lists['Date Listed'])
weeklylistings = listings.groupby(listings['Date Listed'].dt.to_period('W')).count()
weeklylistings = weeklylistings.resample('W').asfreq().fillna(0)
weeklylistingsfig = px.line(weeklylistings,x=weeklylistings.index.to_timestamp(),y='SKU', title='Weekly Listings', labels={"SKU":'Total Listings'},template='seaborn')
weeklylistingsfig.update_layout(xaxis_title=None)
weeklylistingsfig.show()
In [13]:
#Listing count and average list price per year
yearlylistings = listings.groupby(listings['Date Listed'].dt.to_period('Y')).count()
avglistpriceyearly=listings.groupby(listings['Date Listed'].dt.to_period('Y')).mean()
print(yearlylistings['SKU'])
print(avglistpriceyearly['List Price Per Item'])
Date Listed
2018     962
2019    3090
2020    2498
2021    2096
Freq: A-DEC, Name: SKU, dtype: int64
Date Listed
2018    32.988285
2019    34.964858
2020    29.662062
2021    23.802533
Freq: A-DEC, Name: List Price Per Item, dtype: float64

Looks like listings became more inconsistent but there were more units per shipment.

In [14]:
#Sales rank histogram
srhist = px.histogram(lists,x='Sales Rank')
srhist.show()

Lets filter out any book that has a sales rank of over 3 million for this histogram.

I was getting errors when trying to filter the sales rank. Checking the type, it can be seen that Sales Rank is being shown as an object. The error is shown below:

ValueError: invalid literal for int() with base 10: '100K'

So we have some sales ranks showing as 100k, 1M, etc. We can replace these values.

In [15]:
lowsr = lists[lists['Sales Rank']<=3000000]
lowsrhist = px.histogram(lowsr,x='Sales Rank',nbins=30,template='seaborn',title='Sales Rank Histogram',labels={'count':'Sales Rank'})
lowsrhist.show()

Taking note of what we know:

  • Most books listed are in 'Good' condition at about ~60%.
  • Most of the books are listed with less than ~500k sales rank
  • The amount of listings and the average list price has steadily been decreasing year over year

It will be interesting to look at the sales data. My hypothesis will be that the sales have steadily been declining since 2019. The lower average list price and volume year over year will result in declining sales.

Sales Data¶

First, we want to get familiar with the dataset see what we're working with.

In [16]:
sales = pd.read_excel("salesdata.xlsx")
pd.set_option('display.max_columns',None)
print(sales.shape)
sales.head()
(8132, 28)
Out[16]:
date/time settlement id type order id description quantity marketplace account type fulfillment order city order state order postal tax collection model product sales product cost product sales tax shipping credits shipping credits tax gift wrap credits giftwrap credits tax promotional rebates promotional rebates tax Marketplace Withheld Tax selling fees fba fees other tran fees other total
0 Jun 2, 2021 9:35:37 AM PDT 14323467191 Order 114-6318818-1993044 Algebra 2, Student Edition (MERRILL ALGEBRA 2) 1 amazon.com Standard Orders Amazon TUCSON AZ 85747-9073 MarketplaceFacilitator 19.1 -18.66 1.66 0.0 0.0 0.0 0.0 0.0 0.0 -1.66 -4.67 -6.58 0.0 0 -141.43
1 Jun 2, 2021 7:32:44 PM PDT 14323467191 Order 113-4128200-8189858 Algebra 2, Student Edition (MERRILL ALGEBRA 2) 1 amazon.com Standard Orders Amazon PENDLETON IN 46064-9216 MarketplaceFacilitator 19.1 -18.66 1.34 0.0 0.0 0.0 0.0 0.0 0.0 -1.34 -4.67 -6.58 0.0 0 -141.43
2 Jun 4, 2021 2:51:33 AM PDT 14323467191 Order 111-1473468-9293007 Algebra 2, Student Edition (MERRILL ALGEBRA 2) 1 amazon.com Standard Orders Amazon LOS ANGELES CA 90006-2420 MarketplaceFacilitator 19.1 -18.66 1.81 0.0 0.0 0.0 0.0 0.0 0.0 -1.81 -4.67 -6.58 0.0 0 -141.43
3 Jun 4, 2021 1:23:40 PM PDT 14323467191 Order 112-4297224-2108216 Algebra 2, Student Edition (MERRILL ALGEBRA 2) 1 amazon.com Standard Orders Amazon SIERRA MADRE CA 91024-1439 MarketplaceFacilitator 19.1 -18.66 1.96 0.0 0.0 0.0 0.0 0.0 0.0 -1.96 -4.67 -6.58 0.0 0 -141.43
4 Jun 6, 2021 7:55:05 AM PDT 14323467191 Order 113-1136356-2776221 Algebra 2, Student Edition (MERRILL ALGEBRA 2) 1 amazon.com Standard Orders Amazon ALHAMBRA CA 91801-1746 MarketplaceFacilitator 19.1 -18.66 1.96 0.0 0.0 0.0 0.0 0.0 0.0 -1.96 -4.67 -6.58 0.0 0 -141.43
In [17]:
sales.isna().sum()
Out[17]:
date/time                   0
settlement id               0
type                        0
order id                    0
description                 0
quantity                    0
marketplace                 0
account type                0
fulfillment                 0
order city                  0
order state                 0
order postal                2
tax collection model        0
product sales               0
product cost                0
product sales tax           0
shipping credits            0
shipping credits tax        0
gift wrap credits           0
giftwrap credits tax        0
promotional rebates         0
promotional rebates tax     0
Marketplace Withheld Tax    0
selling fees                0
fba fees                    0
other tran fees             0
other                       0
total                       0
dtype: int64
In [18]:
sales.describe()
Out[18]:
settlement id quantity product sales product cost product sales tax shipping credits shipping credits tax gift wrap credits giftwrap credits tax promotional rebates promotional rebates tax Marketplace Withheld Tax selling fees fba fees other tran fees other total
count 8.132000e+03 8132.000000 8132.000000 8132.000000 8132.000000 8132.000000 8132.000000 8132.000000 8132.000000 8132.000000 8132.000000 8132.000000 8132.000000 8132.000000 8132.000000 8132.0 8132.000000
mean 1.298230e+10 1.000123 18.787409 -2.526578 0.790920 0.820627 0.017381 0.001227 0.000049 -0.194108 0.008328 -0.816677 -4.635160 -4.757322 -0.023253 0.0 7.344343
std 1.118912e+09 0.011089 18.313503 6.127053 1.219104 2.474425 0.105221 0.078251 0.004436 1.002575 0.178240 1.242031 2.766585 2.663814 0.149940 0.0 15.174279
min 1.059662e+10 1.000000 1.080000 -139.770000 0.000000 0.000000 0.000000 0.000000 0.000000 -9.640000 0.000000 -38.360000 -76.050000 -53.460000 -0.990000 0.0 -141.430000
25% 1.207199e+10 1.000000 9.060000 -2.930000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -1.060000 -4.862500 -5.260000 0.000000 0.0 1.017500
50% 1.280267e+10 1.000000 13.255000 -2.090000 0.630000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -0.640000 -3.810000 -4.760000 0.000000 0.0 3.490000
75% 1.374972e+10 1.000000 20.310000 0.000000 1.010000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -3.180000 -3.280000 0.000000 0.0 8.792500
max 1.533458e+10 2.000000 494.990000 0.000000 38.360000 49.980000 2.830000 4.990000 0.400000 0.000000 11.300000 0.000000 0.000000 0.000000 0.000000 0.0 413.990000

The dataset looks pretty good. No nan values in important columns, no extreme min or max values, and since all the counts are the same, there are no blanks either. I want to quickly rename the 'date/time' column to just 'date'.

We can also see that the mean sales price is 19.00 with the average product cost being 3.00

In [19]:
sales = sales.rename(columns={'date/time':'date'})
print(sales.columns.values)
['date' 'settlement id' 'type' 'order id' 'description' 'quantity'
 'marketplace' 'account type' 'fulfillment' 'order city' 'order state'
 'order postal' 'tax collection model' 'product sales' 'product cost'
 'product sales tax' 'shipping credits' 'shipping credits tax'
 'gift wrap credits' 'giftwrap credits tax' 'promotional rebates'
 'promotional rebates tax' 'Marketplace Withheld Tax' 'selling fees'
 'fba fees' 'other tran fees' 'other' 'total']

Analysis¶

In [20]:
salesgroup = sales
In [21]:
salesgroup['date'] = pd.to_datetime(sales['date'])
weeklysales = salesgroup.groupby(salesgroup['date'].dt.to_period('W')).sum(numeric_only=True)
weeklysales = weeklysales.resample('W').asfreq().fillna(0)
weeklyfig = px.bar(weeklysales,x=weeklysales.index.to_timestamp(),y='product sales', title='Weekly Sales', labels={"product sales":'Total Sales'},template='seaborn')
weeklyfig.update_layout(xaxis_title=None)
weeklyfig.show()

First few things to note:

  • The overall sales are on a downward trend
  • There is a spike in sales every January and August. This is most likely due to school semesters starting at this time so there's an increased demand for books
  • The most significant drops in sales occur in May and October

Let's see if we can look at average sales price over time as well as the amount of products sold over time.

In [22]:
# Average sales over time
avgsales = salesgroup.groupby(salesgroup['date'].dt.to_period('W')).mean(numeric_only=True)
avgsales = avgsales.resample('W').asfreq().fillna(0)
meanfig = px.bar(avgsales,x=avgsales.index.to_timestamp(),y='product sales', title='Average Sales Price', labels={"product sales":'Average Sales Price'},template='seaborn')
meanfig.update_layout(xaxis_title=None)
meanfig.show()

# Amount of sales over time
cntsales = salesgroup.groupby(salesgroup['date'].dt.to_period('W')).count()
cntsales = cntsales.resample('W').asfreq().fillna(0)
countfig = px.bar(cntsales,x=avgsales.index.to_timestamp(),y='order id', title='Items Sold', labels={"order id":'Items Sold'},template='seaborn')
countfig.update_layout(xaxis_title=None)
countfig.show()

# Average buy cost over time
avgsalescost = salesgroup.groupby(salesgroup['date'].dt.to_period('W')).mean(numeric_only=True)
avgsalescost = avgsalescost.resample('W').asfreq().fillna(0)
costfig = px.bar(avgsalescost,x=avgsales.index.to_timestamp(),y='product cost', title='Product Cost', labels={"order id":'Product Cost'},template='seaborn')
costfig.update_layout(xaxis_title=None)
costfig.update_yaxes(autorange="reversed")
costfig.show()

These two graphs help shine a lot of light on this situation. Firstly, the January and August trends are highlighted again with the items sold chart. Secondly, it seems like the main cause of the total sales trend is the average sales prices. We can see that the average product cost decreased significantly along with the average sales price. Perhaps a new low-cost, high-volume business model was attempted. It would be obvious though that the highest correlation with the decreasing total sales would be with the sales count. We can calculate that correlation.

In [23]:
corr_sales_count = cntsales['order id'].corr(weeklysales['product sales'])

print("Correlation:", corr_sales_count)
Correlation: 0.9045558679104653

It's known that the sales are decreasing but it would be nice to see how much they've decreased year over year.

It should be noted that the dataset includes the first half of January 2022, so we should expect to see a large decrease from 2021 to 2022. Similarly with 2018 to 2019. Only the last half of 2018 is accounted for in this dataset, therefore there should be a large increase in change for 2019.

In [24]:
yearlysales = salesgroup.groupby(salesgroup['date'].dt.to_period('Y')).sum(numeric_only=True)
yearlysales = yearlysales.resample('Y').asfreq().fillna(0)
yoychange = (yearlysales.pct_change()*100)
print("Total sales and profit by year")
print(yearlysales['product sales']),print(yearlysales['total'])
print("")
print("Year over year gross sales change:")
print(yoychange['product sales'])
print("")
print('Year over year profit change:')
print(yoychange['total'])
Total sales and profit by year
date
2018     8230.41
2019    65885.05
2020    44213.13
2021    32535.58
2022     1915.04
Freq: A-DEC, Name: product sales, dtype: float64
date
2018     4472.58
2019    26967.19
2020    16974.41
2021    10577.49
2022      732.53
Freq: A-DEC, Name: total, dtype: float64

Year over year gross sales change:
date
2018           NaN
2019    700.507508
2020    -32.893532
2021    -26.411950
2022    -94.114013
Freq: A-DEC, Name: product sales, dtype: float64

Year over year profit change:
date
2018           NaN
2019    502.944833
2020    -37.055325
2021    -37.685669
2022    -93.074633
Freq: A-DEC, Name: total, dtype: float64

We can clearly see that 2019 was the best performing year. Every year since then has steadily decreased an average ~30% for gross sales and ~37% for profit.