This study performs a comprehensive evaluation of bias and fairness within Large Language Models (LLMs), including ChatGPT-4, Google Gemini, and Llama 2, utilizing the Google BIG-Bench benchmark. Our analysis reveals varied levels of biases across models, with disparities particularly notable in dimensions such as gender, race, and ethnicity. The Google BIG-Bench benchmark proved instrumental in identifying these biases, though its effectiveness is tempered by challenges in capturing the sophisticated manifestations of bias that emerge in real-world contexts. Comparative performance analysis indicates that while each model exhibits strengths in certain areas, no single model uniformly excels across all fairness and bias metrics. The study underscores the intricate balance between model performance, fairness, and efficiency, highlighting the necessity for ongoing research and development in AI ethics to mitigate bias effectively. Insights from this research advocate for a multifaceted approach to AI development, integrating ethical considerations at every stage to ensure the equitable advancement of technology. The findings prompt a call for continued innovation in model training and benchmarking methodologies, aiming to enhance the fairness and inclusivity of future LLMs.