Text Analysis

https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews

A women’s Clothing E-Commerce site revolving around the reviews written by customers. This dataset includes 23486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:

Clothing ID: Integer Categorical variable that refers to the specific piece being reviewed.

Age: Positive Integer variable of the reviewers age.

Title: String variable for the title of the review.

Review Text: String variable for the review body.

Rating: Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.

Recommended IND: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.

Positive Feedback Count: Positive Integer documenting the number of other customers who found this review positive.

Division Name: Categorical name of the product high level division.

Department Name: Categorical name of the product department name.

Class Name: Categorical name of the product class name.

Q1 Perform:

a.Text extraction & creating a corpus

b.Text Pre-processing

c.Create the DTM & TDM from the corpus

d.Exploratory text analysis

e.Feature extraction by removing sparsity

f.Build the Classification Models and compare Logistic Regression to Random Forest regression

https://medium.com/analytics-vidhya/customer-review-analytics-using-text-mining-cd1e17d6ee4e

In order to understand the workings of hypothesis testing it is significant to note that this is a fundamental procedure in evaluating two mutually exclusive statements about a given population to examine which statement is most supported by the sample data (Bonett, D. G., & Wright, T. A. 2015). As such, confidence intervals are used to is an array of values that is most likely to contain an unknown population parameter. The two inherent ideas can be used in examining the body temperature of a given population at a given time.